Model serving patterns This chapter covers Using model serving to generate predictions or make inferences on new data with previously trained machine learning models Handling model serving requests and achieving horizontal scaling with replicated model serving services Processing large model serving requests using the sharded services pattern Assessing model serving systems and eventdriven design
In the previous chapter, we explored some of the challenges involved in the distributed training component, and I introduced a couple of practical patterns that can be incorporated into this component. Distributed training is the most critical part of a distributed machine learning system. For example, we’ve seen challenges when training very large machine learning models that tag main themes in new YouTube videos but cannot fit in a single machine. We looked at how we can overcome the difficulty of using the parameter server pattern. We also learned how to use the collective communication pattern to speed up distributed training for smaller models and avoid unnecessary communication overhead between parameter servers and workers. Last but not least, we talked about some of the vulnerabilities often seen in distributed machine learning systems due to corrupted datasets, unstable networks, and preempted worker machines and how we can address those problems.
Model serving is the next step after we have successfully trained a machine learning model. It is one of the essential steps in a distributed machine learning system. The model serving component needs to be scalable and reliable to handle the growing number of user requests and the growing size of individual requests. It’s also essential to know what tradeoffs we may see when making different design decisions to build a distributed model serving system.
In this chapter, we’ll explore some of the challenges involved in distributed model serving systems, and I’ll introduce a few established patterns adopted heavily in industry. For example, we’ll see challenges when handling the increasing number of model serving requests and how we can overcome these challenges to achieve horizontal scaling with the help of replicated services. We’ll also discuss how the sharded services pattern can help the system process large model serving requests. In addition, we’ll learn how to assess model serving systems and determine whether event-driven design would be beneficial in real-world scenarios.
4.1 What is model serving? Model serving is the process of loading a previously trained machine learning model to generate predictions or make inferences on new input data. It’s the step after we’ve successfully trained a machine learning model. Figure 4.1 shows where model serving fits in the machine learning pipeline.
Figure 4.1 A diagram showing where model serving fits in the machine learning pipeline
Ingest the data and train a machine learning model with the ingested data.
Model serving is the next step after we have successfully trained a machine learning model. We use the trained model to generate predictions or make inferences on new input data.
Note that model serving is a general concept that appears in distributed and traditional machine learning applications. In traditional machine learning applications, model serving is usually a single program that runs on a local desktop or machine and generates predictions on new datasets that are not used for model training. Both the dataset and the machine learning model used should be small enough to fit on a single machine for traditional model serving, and they are stored in the local disk of a single machine.
In contrast, distributed model serving usually happens in a cluster of machines. Both the dataset and the trained machine learning model used for model serving can be very large and must be stored in a remote distributed database or partitioned on disks of multiple machines. The differences between traditional model serving and distributed model serving systems is summarized in table 4.1.
Table 4.1 Comparison between traditional model serving and distributed model serving systems
Traditional model serving Distributed model serving Computational resources Personal laptop or single remote server Cluster of machines Dataset location Local disk on a single laptop or machine Remote distributed database or partitioned on disks of multiple machines Size of model and dataset Small enough to fit on a single machine Medium to large
It’s nontrivial to build and manage a distributed model serving system that’s scalable, reliable, and efficient for different use cases. We will examine a couple of use cases as well as some established patterns that may address different challenges.
4.2 Replicated services pattern: Handling the growing number of serving requests As you may recall, in the previous chapter, we built a machine learning model to tag the main themes of new videos that the model hasn’t seen before using the YouTube-8M dataset (http://research.google.com/youtube8m/), which consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities such as Food, Car, Music, etc. A screenshot of what the videos in the YouTube-8M dataset look like is shown in Figure 4.2.
Now we would like to build a model serving system that allows users to upload new videos. Then, the system would load the previously trained machine learning model to tag entities/themes that appear in the uploaded videos. Note that the model serving system is stateless, so users’ requests won’t affect the model serving results.
The system basically takes the videos uploaded by users and sends requests to the model server. The model server then retrieves the previously trained entity-tagging machine learning model from the model storage to process the videos and eventually generate possible entities that appear in the videos. A high-level overview of the system is shown in figure 4.3.
Figure 4.2 A screenshot of what the videos in the YouTube-8M dataset look like. (Source: Sudheendra Vijayanarasimhan et al. Licensed under Nonexclusive License 1.0)
Figure 4.3 A high-level architecture diagram of the single-node model serving system
Users uploads videos and then submit requests to model serving system to tag the entities within the videos.
Note that this initial version of the model server only runs on a single machine and responds to model serving requests from users on a first-come, first-served basis, as shown in figure 4.4. This approach may work well if only very few users are testing the system. However, as the number of users or model serving requests increases, users will experience huge delays while waiting for the system to finish processing any previous requests. In the real world, this bad user experience would immediately lose our users’ interest in engaging with this system.
Figure 4.4 The model server only runs on a single machine and responds to model serving requests from users on a first-come, first-served basis.
User requests are being processed on first-come, first-served basis.
4.2.1 The problem The system takes the videos uploaded by users and then sends the requests to the model server. These model serving requests are queued and must wait to be processed by the model server.
Unfortunately, due to the nature of the single-node model server, it can only effectively serve a limited number of model serving requests on a first-come, first-served basis. As the number of requests grows in the real world, the user experience worsens when users must wait a long time to receive the model serving result. All requests are waiting to be processed by the model serving system, but the computational resources are bound to this single node. Is there a better way to handle model serving requests than sequentially?
4.2.2 The solution One fact we’ve neglected is that the existing model server is stateless, meaning that the model serving results for each request aren’t affected by other requests, and the machine learning model can only process a single request. In other words, the model server doesn’t require a saved state to operate correctly.
Since the model server is stateless, we can add more server instances to help handle additional user requests without the requests interfering with each other, as shown in figure 4.5. These additional model server instances are exact copies of the original model server but with different server addresses, and each handles different model serving requests. In other words, they are replicated services for model serving or, in short, model server replicas.
Figure 4.5 Additional server instances help handle additional user requests without the requests interfering with each other.
Adding additional resources into our system with more machines is called horizontal scaling. Horizontal scaling systems handle more and more users or traffic by adding more replicas. The opposite of horizontal scaling is vertical scaling, which is usually implemented by adding computational resources to existing machines.
An analogy: Horizontal scaling vs. vertical scaling You can think of vertical scaling like retiring your sports car and buying a race car when you need more horsepower. While a race car is fast and looks amazing, it’s also expensive and not very practical, and at the end of the day, they can only take you so far before running out of gas. In addition, there’s only one seat, and the car must be driven on a flat surface. It is really only suitable for racing. Horizontal scaling gets you that added horsepower—not by favoring sports cars over race cars, but by adding another vehicle to the mix. In fact, you can think of horizontal scaling like several vehicles that could fit a lot of passengers at once. Maybe none of these machines is a race car, but none of them need to be—across the fleet, you have all the horsepower you need.
Let’s return to our original model serving system, which takes the videos uploaded by users and sends requests to the model server. Unlike our previous design of the model serving system, the system now has multiple model server replicas to process the model serving requests asynchronously. Each model server replica takes a single request, retrieves the previously trained entity-tagging machine learning model from model storage, and then processes the videos in the request to tag possible entities in the videos.
As a result, we’ve successfully scaled up our model server by adding model server replicas to the existing model serving system. The new architecture is shown in figure 4.6. The model server replicas are capable of handling many requests at a time since each replica can process individual model serving requests independently.
Figure 4.6 The system architecture after we’ve scaled up our model server by adding model server replicas to the system
Users upload videos and then submit requests to model serving system to tag the entities within the videos.
In the new architecture, multiple model serving requests from users are sent to the model server replicas at the same time. However, we haven’t discussed how they are being distributed and processed. For example, which request is being processed by which model server replica? In other words, we haven’t yet defined a clear mapping relationship between the requests and the model server replicas.
To do that, we can add another layer—namely, a load balancer, which handles the distribution of model serving requests among the replicas. For example, the load balancer takes multiple model serving requests from our users and then distributes the requests evenly to each of the model server replicas, which then are responsible for processing individual requests, including model retrieval and inference on the new data in the request. Figure 4.7 illustrates this process.
The load balancer uses different algorithms to decide which request goes to which model server replica. Example algorithms for load balancing include round robin, the least connection method, hashing, etc.
The replicated services pattern provides a great way to scale our model serving system horizontally. It can also be generalized for any systems that serve a large amount of traffic. Whenever a single instance cannot handle the traffic, introducing this pattern ensures that all traffic can be handled equivalently and efficiently. We’ll apply this pattern in section 9.3.2.
Figure 4.7 A diagram showing how a loader balancer is used to distribute the requests evenly across model server replicas
Multiple model serving requests from users
The load balancer distributes the requests evenly to each of the model server replicas.
Round robin for load balancing Round robin is a simple technique in which the load balancer forwards each request to a different server replica based on a rotating list.
Even though it’s easy to implement a load balancer with the round-robin algorithm, the load is already on a load balancer server, and it might be dangerous if the load balancer server itself receives a lot of requests that require expensive processing. It may become overloaded past the point it can effectively do its job.
4.2.3 Discussion Now that we have load-balanced model server replicas in place, we should be able to support the growing number of user requests, and the entire model serving system achieves horizontal scaling. Not only can we handle model serving requests in a scalable way, but the overall model serving system also becomes highly available (https:// mng.bz/EQBd). High availability is a characteristic of a system that maintains an agreed-on level of operational performance, usually uptime, for a longer-than-normal period. It’s often expressed as a percentage of uptime in a given year.
For example, some organizations may require services to reach a highly available service-level agreement, which means the service is up and running 99.9% of the time(known as three-nines availability). In other words, the service can only get 1.4 minutes of downtime per day (24 hours × 60 minutes × 0.1%). With the help of replicated model services, if any of the model server replicas crashes or gets preempted on a spot instance, the remaining model server replicas are still available and ready to process any incoming model serving requests from users, which provides a good user experience and makes the system reliable.
In addition, since our model server replicas will need to retrieve previously trained machine learning models from a remote model storage, they need to be ready in addition to being alive. It’s important to build and deploy readiness probes to inform the load balancer that the replicas are all successfully established connections to the remote model storage and are ready to serve model serving requests from users. A readiness probe helps the system determine whether a particular replica is ready to serve. With readiness probes, users do not experience unexpected behaviors when the system is not ready due to internal system problems.
The replicated services pattern addresses our horizontal scalability problem that prevents our model serving system from supporting a large number of model serving requests. However, in real-world model serving systems, not only the number of serving requests increases but also the size of each request, which can get extremely large if the data or the payload is large. In that case, replicated services may not be able to handle the large requests. We will talk about that scenario and introduce a pattern that would alleviate the problem in the next section.
4.2.4 Exercises 1 Are replicated model servers stateless or stateful? 2 What happens when we don’t have a load balancer as part of the model serving system? 3 Can we achieve three-nines service-level agreements with only one model server instance?
4.3 Sharded services pattern The replicated services pattern efficiently resolves our horizontal scalability problem so that our model serving system can support a growing number of user requests. We achieve the additional benefit of high availability with the help of model server replicas and a load balancer.
NOTE Each model server replica has a limited and pre-allocated amount of computational resources. More important, the amount of computational resources for each replica must be identical for the load balancer to distribute requests correctly and evenly.
Next, let’s imagine that a user wants to upload a high-resolution YouTube video that needs to be tagged with an entity using the model server application. Even though the high-resolution video is too large, it may be uploaded successfully to the model server replica if it has sufficient disk storage. However, we could not process the request in any of the individual model server replicas themselves since processing this single large request would require a larger memory allocated in the model server replica. This need for a large amount of memory is often due to the complexity of the trained machine learning model, as it may contain a lot of expensive matrix computations or mathematical operations, as we’ve seen in the previous chapter.
For instance, a user uploads a high-resolution video to the model serving system through a large request. One of the model server replicas takes this request and successfully retrieves the previously trained machine learning model. Unfortunately, the model then fails to process the large data in the request since the model server replica that’s responsible for processing this request does not have sufficient memory. Eventually, we may notify the user of this failure after they have waited a long time, which results in a bad user experience. A diagram for this situation is shown in figure 4.8.
Figure 4.8 A diagram showing that model fails to process the large data in the request since the model server replica responsible for processing this request does not have sufficient memory
User uploads a high-resolution video to the model serving system. This fails as the model server replica that’s processing this large request does not have enough computational resources.
4.3.1 The problem: Processing large model serving requests with high-resolution videos The requests the system is serving are large since the videos users upload are high resolution. In cases where the previously trained machine learning model may contain expensive mathematical operations, these large video requests cannot be successfully processed and served by individual model server replicas with a limited amount of memory. How do we design the model serving system to handle large requests of highresolution videos successfully?
4.3.2 The solution Given our requirement for the computational resources on each model server replica, can we scale vertically by increasing each replica’s computational resources so it can handle large requests like high-resolution videos? Since we are vertically scaling all the replicas by the same amount, we will not affect our load balancer’s work.
Unfortunately, we cannot simply scale the model server replicas vertically since we don’t know how many such large requests there are. Imagine only a couple of users have high-resolution videos needing to be processed (e.g., professional photographers who have high-end cameras that capture high-resolution videos), and the remaining vast majority of the users only upload videos from their smartphones with much smaller resolutions. As a result, most of the added computational resources on the model server replicas are idling, which results in very low resource utilization. We will examine the resource utilization perspective in the next section, but for now, we know that this approach is not practical.
Remember we introduced the parameter server pattern in chapter 3, which allows us to partition a very large model? Figure 4.9 is the diagram we discussed in chapter 3 that shows distributed model training with multiple parameter servers; the large model has been partitioned, and each partition is located on different parameter servers. Each worker node takes a subset of the dataset, performs calculations required in each neural network layer, and then sends the calculated gradients to update one model partition stored in one of the parameter servers.
Figure 4.9 Distributed model training with multiple parameter servers where the large model has been sharded and each partition is located on different parameter servers
To deal with our problem of large model serving requests, we can borrow the same idea and apply it to our particular scenario.
We first divide the original high-resolution video into multiple separate videos, and then each video is processed by multiple model server shards independently. The model server shards are partitions from a single model server instance, and each is responsible for processing a subset of a large request.
The diagram in figure 4.10 is an example architecture of the sharded services pattern. In the diagram, a high-resolution video that contains a dog and a kid gets divided into two separate videos where each of the videos represents a subset of the original large request. One of the separated videos contains the part where the dog appears, and the other video contains the part where the kid appears. These two separated videos become two separate requests and are processed by different model server shards independently.
Figure 4.10 An example architecture of the sharded services pattern where a high-resolution video gets divided into two separate videos. Each video represents a subset of the original large request and is processed by different model server shard independently.
The high-resolution video is divided into two separate videos and sent to each of the model server shard.
After the model server shards receive the sub-requests where each contains part of the original large model serving request, each model server shard then retrieves the previously trained entity-tagging machine learning model from model storage and then processes the videos in the request to tag possible entities that appear in the videos, similar to the previous model serving system we’ve designed. Once all the sub-requests have been processed by each of the model server shards, we merge the model inference result from two sub-requests—namely, the two entities, dog and kid—to obtain a result for the original large model serving request with the high-resolution video.
How do we distribute the two sub-requests to different model server shards? Similar to the algorithms we use to implement the load balancer, we can use a sharding function, which is very similar to a hashing function, to determine which shard in the list of model server shards should be responsible for processing each sub-request.
Usually, the sharding function is defined using a hashing function and the modulo(%) operator. For example, hash(request) % 10 would return 10 shards even when the outputs of the hash function are significantly larger than the number of shards in a sharded service.
Characteristics of hashing functions for sharding The hashing function that defines the sharding function transforms an arbitrary object into an integer representing a particular shard index. It has two important characteristics: 1 The output from hashing is always the same for a given input. 2 The distribution of outputs is always uniform within the output space. These characteristics are important and can ensure that a particular request will always be processed by the same shard server and that the requests are evenly distributed among the shards.
The sharded services pattern solves the problem we encounter when building model serving systems at scale and provides a great way to handle large model serving requests. It’s similar to the data-sharding pattern we introduced in chapter 2: instead of applying sharding to datasets, we apply sharding to model serving requests. When a distributed system has limited computational resources for a single machine, we can apply this pattern to offload the computational burden to multiple machines.
4.3.3 Discussion The sharded services pattern helps handle large requests and efficiently distributes the workload of processing large model serving requests to multiple model server shards. It’s generally useful when considering any sort of service where the data exceeds what can fit on a single machine.
However, unlike the replicated services pattern we discussed in the previous section, which is useful when building stateless services, the sharded services pattern is generally used for building stateful services. In our case, we need to maintain the state or the results from serving the sub-requests from the original large request using sharded services and then merge the results into the final response so it includes all entities from the original high-resolution video.
In some cases, this approach may not work well because it depends on how we divide the original large request into smaller requests. For example, if the original video has been divided into more than two sub-requests, some may not be meaningful since they don’t contain any complete entities that are recognizable by the machine learning model we’ve trained. For situations like that, we need additional handling and cleaning of the merged result to remove meaningless entities that are not useful to our application.
Both the replicated services pattern and sharded services pattern are valuable when building a model serving system at scale to handle a great number of large model serving requests. However, to incorporate them into the model serving system, we need to know the required computational resources at hand, which may not be available if the traffic is rather dynamic. In the next section, I will introduce another pattern focusing on model serving systems that can handle dynamic traffic.
4.3.4 Exercises 1 Would vertical scaling be helpful when handling large requests? 2 Are the model server shards stateful or stateless?
4.4 The event-driven processing pattern The replicated services pattern we examined in section 4.2 helps handle a large number of model serving requests, and the sharded services pattern in section 4.3 can be used to process very large requests that may not fit in a single model server instance. While these patterns address the challenges of building model serving systems at scale, they are more suitable when the system knows how much computational resources, model server replicas, or model server shards to allocate before the system starts taking user requests. However, for cases in which we do not know how much model serving traffic the system will be receiving, it’s hard to allocate and use resources efficiently.
Now imagine that we work for a company that provides holiday and event planning services to subscribed customers. We’d like to provide a new service that will use a trained machine learning model to predict hotel prices per night for the hotels located in resort areas, given a range of dates and a specific location where our customers would like to spend their holidays.
To provide that service, we can design a machine learning model serving system. This model serving system provides a user interface where users can enter the range of dates and locations they are interested in staying for holidays. Once the requests are sent to the model server, the previously trained machine learning model will be retrieved from the distributed database and process the data in the requests (dates and locations). Eventually, the model server will return the predicted hotel prices for each location within the given date range. The complete process is shown in figure 4.11.
After we test this model serving system for one year on selected customers, we will have collected sufficient data to plot the model serving traffic over time. As it turns out, people prefer to book their holidays at the last moment, so traffic increases abruptly shortly before holidays and then decreases again after the holiday periods. The problem with this traffic pattern is that it introduces a very low resource utilization rate.
In our current architecture of model serving system, the underlying computational resources allocated to the model remain unchanged at all times. This strategy seems far from optimal: during periods of low traffic, most of our resources are idling and thus wasted, whereas during periods of high traffic, our system struggles to respond in a timely fashion, and more resources than normal are required to operate. In other words, the system has to deal with either high or low traffic with the same amount of computational resources (e.g., 10 CPUs and 100 GB of memory), as shown in figure 4.12.
Figure 4.11 A diagram of the model serving system to predict hotel prices
Users enter date range and location and then submit requests to the serving system.
Figure 4.12 The traffic changes of the model serving system over time with an equal amount of computational resources allocated all the time.
The peak traffic arrives at about one week away from Christmas (10 CPUs and 100 GBs of memory).
The traffic decreases dramatically during Christmas (10 CPUs and 100 GBs of memory).
There is very little traffic at two weeks after Christmas(10 CPUs and 100 GBs of memory).
Since we know, more or less, when those holiday periods are, why don’t we plan accordingly? Unfortunately, some events make it hard to predict surges in traffic. For example, a huge international conference may be planned near one of the resorts, as shown in figure 4.13. This unexpected event, which happens before Christmas, has suddenly added traffic at that particular time window (solid line). Not knowing about the conferences, we would miss a window that should be taken into account when allocating computational resources. Specifically, in our scenario, two CPUs and 20 GB of memory, although optimized for our use case, no longer is sufficient to handle all resources within this time window. The user experience would be very bad. Imagine all the conference attendants sitting in front of their laptops, waiting a long time to book a hotel room.
Figure 4.13 The traffic of our model serving system over time with an optimal amount of computational resources allocated for different time windows. In addition, an unexpected event happened before Christmas that suddenly added traffic during that particular time window (solid line).
2 CPUs and 20 GBs of memory Resource utilization rate: high
10 CPUs and 100 GBs of memory Resource utilization rate: high
Huge international conference
2 CPUs and 20 GBs of memory Resource utilization rate: high and unable to handle all requests within this time window
20 CPUs and 200 GBs of memory Resource utilization rate: high
1 CPU and 10 GBs of memory Resource utilization rate: high
In other words, this naive solution is still not very practical and effective since it’s nontrivial to figure out the time windows to allocate different amounts of resources and how much additional resources are needed for each time window. Can we come up with any better approach?
In our scenario, we are dealing with a dynamic number of model serving requests that varies over time and is highly correlated to times around holidays. What if we can guarantee we have enough resources and forget about our goal of increasing the resource utilization rate for now? If the computational resources are guaranteed to be more than sufficient at all times, we can make sure that the model serving system can handle heavy traffic during holiday seasons.
4.4.1 The problem: Responding to model serving requests based on events The naive approach, which is to estimate and allocate computational resources accordingly before identifying any possible time windows in which the system might experience a high volume of traffic, is not feasible. It’s not easy to determine the exact dates of the high-traffic time windows and the exact amount of computational resources needed during each.
Simply increasing the computational resources to an amount sufficient at all times also is not practical, as the resource utilization rate we were concerned about earlier remains low. For example, if nearly no user requests are made during a particular time period, the computational resources we have allocated are, unfortunately, mostly idling and thus wasted. Is there another approach that allocates and uses computational resources more wisely?
4.4.2 The solution The solution to our problem is maintaining a pool of computational resources (e.g., CPUs, memory, disk, etc.) allocated not only to this particular model serving system but also to model serving of other applications or other components of the distributed machine learning pipeline.
Figure 4.14 is an example architecture diagram where a shared resource pool is used by different systems—for example, data ingestion, model training, model selection, model deployment, and model serving—at the same time. This shared resource pool gives us enough resources to handle peak traffic for the model serving system by pre-allocating resources required during historical peak traffic and autoscaling when the limit is reached. Therefore, we only use resources when needed and only the specific amount of resources required for each particular model serving request.
For our discussions, I only focus on the model serving system in the diagram, and details for other systems are neglected here. In addition, here I assume that the model training component only utilizes similar types of resources, such as CPUs. If the model training component requires GPUs or a mix of CPUs/GPUs, it may be better to use a separate resource pool, depending on specific use cases.
When the users of our hotel price prediction application enter into the UI the range of dates and locations that they are interested in staying for holidays, the model serving requests are sent to the model serving system. Upon receiving each request, the system notifies the shared resource pool that certain amounts of computational resources are being used by the system.
For example, figure 4.15 shows the traffic of our model serving system over time with an unexpected bump. The unexpected bump is due to a new very large international conference that happens before Christmas. This event suddenly adds traffic, but the model serving system successfully handles the surge in traffic by borrowing a necessary amount of resources from the shared resource pool With the help of the shared resource pool, the resource utilization rate remains high during this unexpected event. The shared resource pool monitors the current amount of available resources and autoscales when needed.
Figure 4.14 An architecture diagram in which a shared resource pool is being used by different components—for example, data ingestion, model training, model selection, and model deployment—and two different model serving systems at the same time. The arrows with solid lines indicate resources, and the arrows with dashed lines indicate requests.
Figure 4.15 The traffic of our model serving system over time. An unexpected bump happened before Christmas that suddenly added traffic. The jump in requests is handled successfully by the model serving system by borrowing the necessary amount of resources from the shared resource pool. The resource utilization rate remains high during this unexpected event.
2 CPUs and 20 GBs of memory Resource utilization rate: high
10 CPUs and 100 GBs of memory Resource utilization rate: high
Huge international conference
“Borrow” necessary resource from resource pool Resource utilization rate: high
20 CPUs and 200 GBs of memory Resource utilization rate: high
1 CPU and 10 GBs of memory Resource utilization rate: high
This approach, in which the system listens to the user requests and only responds and utilizes the computational resources when the user request is being made, is called event-driven processing.
Event-driven processing vs. long-running serving systems Event-driven processing is different from the model serving systems that we’ve looked at in previous sections (e.g., systems using replicated services [section 4.2]and sharded services patterns [section 4.3]), where the servers that handle user requests are always up and running. Those long-running serving systems work well for many applications that are under heavy load, keep a large amount of data in memory, or require some sort of background processing. However, for applications that handle very few requests during nonpeak periods or respond to specific events, such as our hotel price prediction system, the event-driven processing pattern is more suitable. This event-driven processing pattern has flourished in recent years as cloud providers have developed function-as-a-service products.
In our scenario, each model serving request made from our hotel price prediction system represents an event. Our serving system listens for this type of event, utilizes necessary resources from the shared resource pool, and retrieves and loads the trained machine learning model from the distributed database to estimate the hotel prices for the specified time/location query. Figure 4.16 is a diagram of this event-driven model serving system.
Using this event-driven processing pattern for our serving system, we can make sure that our system is using only the resources necessary to process every request without concerning ourselves with resource utilization and idling. As a result, the system has sufficient resources to deal with peak traffic and return the predicted prices without users experiencing noticeable delays or lags when using the system.
Even though we now have a shared pool of sufficient computational resources where we can borrow computational resources from the shared resource pool to handle user requests on demand, we should also build a mechanism in our model serving system to defend denial-of-service attacks. Denial-of-service attacks interrupt an authorized user’s access to a computer network, typically caused with malicious intent and often seen in model serving systems. These attacks can cause unexpected use of computational resources from the shared resource pool, which may eventually lead to resource scarcity for other services that rely on the shared resource pool.
Denial-of-service attacks may happen in various cases. For example, they may come from users who accidentally send a huge amount of model serving requests in a very short period of time. Developers may have misconfigured a client that uses our model serving APIs, so it sends requests constantly or accidentally kicks off an unexpected load/stress test in a production environment.
Figure 4.16 A diagram of the event-driven model serving system to predict hotel prices
Users enter date range and location and then submit requests to the serving system.
To deal with these situations, which often happen in real-world applications, it makes sense to introduce a defense mechanism for denial-of-service attacks. One approach to avoid these attacks is via rate limiting, which adds the model serving requests to a queue and limits the rate the system is processing the requests in the queue.
Figure 4.17 is a flowchart showing four model serving requests sent to the model serving system. However, only two are under the current rate limit, which allows a maximum of two concurrent model serving requests. In this case, the rate-limiting queue for model serving requests first checks whether the requests received are under the current rate limit. Once the system has finished processing those two requests, it will proceed to the remaining two requests in the queue.
If we are deploying and exposing an API for a model serving service to our users, it’s also generally a best practice to have a relatively small rate limit (e.g., only one request is allowed within 1 hour) for users with anonymous access and then ask users to log in to obtain a higher rate limit. This system would allow the model serving system to better control and monitor the users’ behavior and traffic so that we can take necessary actions to address any potential problems or denial-of-service attacks. For example, requiring a login provides auditing to find out which users/events are responsible for the unexpectedly large number of model serving requests.
Figure 4.17 A flowchart of four model serving requests sent to the model serving system. However, only two are under the current rate limit, which allows a maximum of two concurrent model serving requests. Once the system has finished processing those two requests, it will proceed to the remaining two requests in the queue.
Ok to add to the queue?
Model serving requests that are under rate limit: 2 maximum concurrent requests
Figure 4.18 demonstrates the previously described strategy. In the diagram, the flowchart on the left side is the same as figure 4.17 where four total model serving requests from unauthenticated users are sent to the model serving system. However, only two can be served by the system due to the current rate limit, which allows a maximum of two concurrent model serving requests for unauthenticated users. Conversely, the model serving requests in the flowchart on the right side all come from authenticated users. Thus, three requests can be processed by the model serving system since the limit of maximum concurrent requests for authenticated users is three.
Figure 4.18 A comparison of behaviors from different rate limits applied to authenticated and unauthenticated users
Ok to add to the queue?
Model serving requests that are under rate limit: 2 maximum concurrent requests for unauthenticated users
Ok to add to the queue?
Model serving requests that are under rate limit: 3 maximum concurrent requests for authenticated users
Rate limits differ depending on whether the user is authenticated. Rate limits thus effectively control the traffic of the model serving system and prevent malicious denial-of-service attacks, which could cause unexpected use of computational resources from the shared resource pool and eventually lead to resource scarcity of other services that rely on it.
4.4.3 Discussion Even though we’ve seen how the event-driven processing pattern benefits our particular serving system, we should not attempt to use this pattern as a universal solution. The use of many tools and patterns can help you develop a distributed system to meet unique real-world requirements.
For machine learning applications with consistent traffic—for example, model predictions calculated regularly based on a schedule—an event-driven processing approach is unnecessary as the system already knows when to process the requests, and there will be too much overhead trying to monitor this regular traffic. In addition, applications that can tolerate less-accurate predictions can work well without being driven by events; they can also recalculate and provide good-enough predictions to a particular granularity level, such as per day or per week.
Event-driven processing is more suitable for applications with different traffic patterns that are complicated for the system to prepare beforehand necessary computational resources. With event-driven processing, the model serving system only requests a necessary amount of computational resources on demand. The applications can also provide more accurate and real-time predictions since they obtain the predictions right after the users send requests instead of relying on precalculated prediction results based on a schedule.
From developers’ perspective, one benefit of the event-driven processing pattern is that it’s very intuitive. For example, it greatly simplifies the process of deploying code to running services since there is no end artifact to create or push beyond the source code itself. The event-driven processing pattern makes it simple to deploy code from our laptops or web browser to run code in the cloud.
In our scenario, we only need to deploy the trained machine learning model that may be used as a function to be triggered based on user requests. Once deployed, this model serving function is then managed and scaled automatically without the need to allocate resources manually by developers. In other words, as more traffic is loaded onto the service, more instances of the model serving function are created to handle the increase in traffic using the shared resource pool. If the model serving function fails due to machine failures, it will be restarted automatically on other machines in the shared resource pool.
Given the nature of the event-driven processing pattern, each function that’s used to process the model serving requests needs to be stateless and independent from other model serving requests. Each function instance cannot have local memory, which requires all states to be stored in a storage service. For example, if our machine learning models depend heavily on the results from previous predictions(e.g., a time-series model), in this case, the event-driven processing pattern may not be suitable.
4.4.4 Exercises 1 Suppose we allocate the same amount of computational resources over the lifetime of the model serving system for hotel price prediction. What would the resource utilization rate look like over time? 2 Are the replicated services or sharded services long-running systems? 3 Is event-driven processing stateless or stateful?
4.5 Answers to exercises Section 4.2 1 Stateless 2 The model server replicas would not know which requests from users to process, and there will be potential conflicts or duplicate work when multiple model server replicas try to process the same requests. 3 Yes, only if the single server has no more than 1.4 minutes of downtime per day
Section 4.3 1 Yes, it helps, but it would decrease the overall resource utilization. 2 Stateful
Section 4.4 1 It varies over time depending on the traffic. 2 Yes. Servers are required to keep them running to accept user requests, and computational resources need to be allocated and occupied all the time. 3 Stateless
Summary Model serving is the process of loading a previously trained machine learning model, generating predictions, or making inferences on new input data. Replicated services help handle the growing number of model serving requests and achieve horizontal scaling with the help of replicated services. The sharded services pattern allows the system to handle large requests and efficiently distributes the workload of processing large model serving requests to multiple model server shards. With the event-driven processing pattern, we can ensure that our system only uses the resources necessary to process every request without worrying about resource utilization and idling.