Posted by Hao Liang's Blog on Monday, January 1, 0001

This chapter covers  Using workflows to connect machine learning system components  Composing complex but maintainable structures within machine learning workflows with the fan-in and fan-out patterns  Accelerating machine learning workloads with concurrent steps using synchronous and asynchronous patterns  Improving performance with the step memoization pattern

Model serving is a critical step after successfully training a machine learning model. It is the final artifact produced by the entire machine learning workflow, and the results from model serving are presented to users directly. Previously, we explored some of the challenges involved in distributed model serving systems—for example, how to handle the growing number of model serving requests and the increased size of those requests—and investigated a few established patterns heavily adopted in industry. We learned how to achieve horizontal scaling with the help of replicated services to address these challenges and how the sharded services pattern can help the system process large model serving requests. Finally, we learned how to assess model serving systems and determine whether an event-driven design would be beneficial in real-world scenarios.

Workflow is an essential component in machine learning systems as it connects all other components in the system. A machine learning workflow can be as easy as chaining data ingestion, model training, and model serving. However, it can be very complex to handle real-world scenarios requiring additional steps and performance optimizations as part of the entire workflow. It’s essential to know what tradeoffs we may see when making design decisions to meet different business and performance requirements.

In this chapter, we’ll explore some of the challenges involved when building machine learning workflows in practice. Each of these established patterns can be reused to build simple to complex machine learning workflows that are efficient and scalable. For example, we’ll see how to build a system to execute complex machine learning workflows to train multiple machine learning models. We will use the fanin and fan-out patterns to select the most performant models that provide good entity-tagging results in the model serving system. We’ll also incorporate synchronous and asynchronous patterns to make machine learning workflows more efficient and avoid delays due to the long-running model training steps that block other consecutive steps.

5.1 What is workflow? Workflow is the process of connecting multiple components or steps in an end-to-end machine learning system. A workflow consists of arbitrary combinations of the components commonly seen in real-world machine learning applications, such as data ingestion, distributed model training, and model serving, as discussed in the previous chapters. Figure 5.1 shows a simple machine learning workflow. This workflow connects multiple components or steps in an end-to-end machine learning system that includes the following steps: 1 Data ingestion—Consumes the Youtube-8M videos dataset 2 Model training—Trains an entity-tagging model 3 Model serving—Tags entities in unseen videos NOTE A machine learning workflow is often referred to as a machine learning pipeline. I use these two terms interchangeably. Although I use different terms to refer to different technologies, there is no difference between the two terms in this book. Since a machine learning workflow may consist of any combination of the components, we often see machine learning workflows in different forms in different situations. Unlike the straightforward workflow shown in figure 5.1, figure 5.2 illustrates a more complicated workflow where two separate model training steps are launched after a single data ingestion step, and then two separate model serving steps are used to serve different models trained via different model training steps.

Figure 5.1 A diagram showing a simple machine learning workflow, including data ingestion, model training, and model serving. The arrows indicate directions. For example, the arrow on the right-hand side denotes the order of the step execution (e.g., the workflow executes the model serving step after the model training step is completed).

A workflow connects multiple components or steps in an end-to-end machine learning system.

The arrow indicates the direction that the workflow executes the model serving step after the model training step is completed.

Figure 5.2 A more complicated workflow, where two separate model training steps are launched after a single data ingestion step, and then two separate model serving steps are used to serve different models trained via different model training steps

Two separate model training steps are launched after a single data ingestion step, and then two separate model serving steps are used to serve different models trained via different model training steps.

Figures 5.1 and 5.2 are just some common examples. In practice, the complexity of machine learning workflows varies, which increases the difficulty of building and maintaining scalable machine learning systems. We will discuss some of the more complex machine learning workflows in this chapter, but to start, I’ll introduce and distinguish the differences between the following two concepts: sequential workflow and directed acyclic graph (DAG). A sequential workflow represents a series of steps performed one after another until the last step in the series is complete. The exact order of execution varies, but steps will always be sequential. Figure 5.3 is an example sequential workflow with three steps executed sequentially.

Figure 5.3 An example sequential workflow with three steps that execute in the following order: A, B, and C.

A sequential workflow represents a series of steps performed one after another until the last step in the series has completed. The exact order of execution varies, but steps will always be sequential.

Step C executes after step B has completed.

A workflow can be seen as a DAG if it only consists of steps directed from one step to another but never form a closed loop. For example, the workflow in figure 5.3 is a valid DAG since the three steps are directed from step A to step B and then from step B to step C—the loop is not closed. Another example workflow, shown in figure 5.4, however, is not a valid DAG since there’s an additional step D that connects from step C and points to step A, which forms a closed loop. If step D does not point back to step A, as shown in figure 5.5, where the arrow is crossed out, this workflow becomes a valid DAG. The loop is no longer closed, and thus it becomes a simple sequential workflow, similar to figure 5.3. In real-world machine learning applications, workflows necessary to meet the requirements of different use cases (e.g., batch retraining of the models, hyperparameter tuning experiments, etc.) can get really complicated. We will go through some more complex workflows and abstract the structural patterns that can be reused to compose workflows for various scenarios.

Figure 5.4 An example workflow where step D connects from step C and points to step A. These connections form a closed loop and thus the entire workflow is not a valid DAG.

A workflow where there’s an additional step D that connects from step C and points to step A. These connections form a closed loop and thus the entire workflow is not a valid DAG.

Figure 5.5 An example workflow where the last step D does not point back to step A. This workflow is not a valid DAG since the closed loop no longer exists. Instead, it is a simple sequential workflow similar to figure 5.3.

This workflow becomes a valid DAG since the closed loop no longer exists, and this becomes a simple sequential workflow similar to what we’ve seen previously.

The closed loop no longer exists since this arrow is crossed out.

5.2 Fan-in and fan-out patterns: Composing complex machine learning workflows In chapter 3, we built a machine learning model to tag the main themes of new videos that the model hadn’t seen before using the YouTube-8M dataset. The YouTube-8M dataset consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities such as Food, Car, Music, etc. In chapter 4, we also discussed patterns that are helpful to build scalable model serving systems where users can upload new videos, and then the system loads the previously trained machine learning model to tag entities/themes that appear in the uploaded videos. In real-world applications, we often want to chain these steps together and package them in a way that can be easily reused and distributed. For example, what if the original YouTube-8M dataset has been updated, and we’d like to train a new model from scratch using the same model architecture? In this case, it’s pretty easy to containerize each of these components and chain them together in a machine learning workflow that can be reused by re-executing the end to end workflow when the data gets updated. As shown in figure 5.6, new videos are regularly being added to the original YouTube-8M dataset, and the workflow is executed every time the dataset is updated. The next model training step trains the entity tagging model using the most recent dataset. Then, the last model serving step uses the trained model to tag entities in unseen videos.

Figure 5.6 New videos are regularly added to the original YouTube-8M dataset, and the workflow is executed every time the dataset is updated.

New videos are being added to the original YouTube-8M dataset regularly, and the workflow is being executed every time the dataset has been updated.

Trains the entity tagging model using the most recent dataset

Uses the trained model to tag entities in unseen videos

Now, let’s take a look at a more complex real-world scenario. Let’s assume we know the implementation details for model training of any machine learning model architecture.

5.3 Synchronous and asynchronous patterns: Accelerating workflows with concurrency

Each model training step in the system takes a long time to complete; however, their durations may vary across different model architectures or model parameters. Imagine an extreme case where one of the model training steps takes two weeks to complete since it is training a complex machine learning model that requires a huge amount of computational resources. All other model training steps only take one week to complete. Many of the steps, such as model selection and model serving, in the machine learning workflow we built earlier that uses the fan-in and fan-out patterns will have to wait an additional week until this long-running model training step is completed. A diagram that illustrates the duration differences among the three model training steps is shown in figure 5.14.

Figure 5.14 A workflow that illustrates the duration differences for the three model training steps

One of the model training steps takes two weeks to complete since it is training a complex machine learning model that requires a huge amount of computational resources, whereas each of the rest of the model training steps only takes one week to complete.

The following steps will have to wait for an additional week until this long-running model training step is completed.

In this case, since the model selection step and the steps following it require all model training steps to finish, the model training step that takes two weeks to complete will slow down the workflow by an entire week. We would rather use that additional week to re-execute all the model training steps that take one week to complete instead of wasting time waiting for one step!

5.3.1 The problem We want to build a machine learning workflow that trains different models and then selects the top two models to use for model serving, which generates predictions based on the knowledge of both models. Due to varying completion times for each model training step in the existing machine learning workflow, the start of the following steps, such as the model selection step and the model serving, depends on the completion of the previous steps.

However, a problem occurs when at least one of the model training steps takes much longer to complete than the remaining steps because the model selection step that follows can only start after this long model training step has completed. As a result, the entire workflow is delayed by this particularly long-running step. Is there a way to accelerate this workflow so it will not be affected by the duration of individual steps?

5.3.2 The solution We want to build the same machine learning workflow as we did previously, which would train different models after the system has ingested data from the data source, select the top two models, and then use these two models to provide model serving to generate predictions using knowledge from both models.

However, this time we noticed a performance bottleneck because the start of each following step, such as model selection and model serving, depends on the completion of its previous steps. In our case, we have one long-running model training step that must complete before we can proceed to the next step.

What if we can exclude the long-running model training step completely? Once we do that, the rest of the model training steps will have consistent completion times. Thus, the remaining steps in the workflow can be executed without waiting for a particular step that’s still running. A diagram of the updated workflow is shown in figure 5.15.

This naive approach may resolve our problem of extra waiting time for long-running steps. However, our original goal was to use this type of complex workflow to experiment with different machine learning model architectures and different sets of hyperparameters of those models to select the best-performing models to use for model serving. If we simply exclude the long-running model training step, we are essentially throwing away the opportunity to experiment with advanced models that may better capture the entities in the videos.

Is there a better way to speed up the workflow so that it will not be affected by the duration of this individual step? Let’s focus on the model training steps that only take one week to complete. What can we do when those short-running model training steps are complete?

Figure 5.15 The new workflow after the long-running model training step has been removed

After the long-running model training step is excluded, the rest of the model training steps will have consistent completion time. Thus, the remaining steps in the workflow can be executed without having to wait for any particular step that’s still running.

When a model training step finishes, we have successfully obtained a trained machine learning model. In fact, we can use this trained model in our model serving system without waiting for the rest of the model training steps to complete. As a result, the users can see the results of tagged entities from their model serving requests that contain videos as soon as we have trained one model from one of the steps in the workflow. A diagram of this workflow is shown in figure 5.16.

After a second model training step finishes, we can then pass the two trained models directly to model serving. The aggregated inference results are presented to users instead of the results from only the model we obtained initially, as shown in figure 5.17.

Figure 5.16 A workflow where the trained model from a short-running model training step is applied directly to our model serving system without waiting for the remaining model training steps to complete

Uses the trained model from this short-running model training step that finishes earlier directly in our model serving system without waiting for the rest of the model training steps to complete

Figure 5.17 After a second model training step finishes, we pass the two trained models directly to model serving. The aggregated inference results are presented to users instead of only the results from the model that we obtained initially.

After a second model training step finishes, we can pass the two trained models directly to be used for model serving, and the aggregated inference results will be presented to users instead of the results from only the one model that we obtained initially.

Both short-running model training steps have finished.

Note that while we can continue to use the trained models for model selection and model serving, the long-running model training step is still running. In other words, the steps are executed asynchronously—they don’t depend on each other’s completion. The workflow starts executing the next step before the previous step finishes.

Sequential steps are performed one at a time, and only when one has completed does the following step become unblocked. In other words, you must wait for a step to finish to move to the next one. For example, the data ingestion step must be completed before we start any of the model training steps.

Contrary to asynchronous steps, synchronous steps can start running at the same time once dependencies are met. For example, the model training steps can run concurrently, as soon as the previous data ingestion step has finished. A different model training step does not have to wait for another to start. The synchronous pattern is typically useful when you have multiple similar workloads that can run concurrently and finish near the same time.

By incorporating these patterns, the entire workflow will no longer be blocked by the long-running model training step. Instead, it can continue using the alreadytrained models from the short-running model training steps in the model serving system, which can start handling users’ model serving requests.

The synchronous and asynchronous patterns are also extremely useful in other distributed systems to optimize system performance and maximize the use of existing computational resources—especially when the amount of computational resources for heavy workloads is limited. We’ll apply this pattern in section 9.4.1.

5.3.3 Discussion By mixing synchronous and asynchronous patterns, we can create more efficient machine learning workflows and avoid any delays due to steps that prevent others from executing, such as a long-running model training step. However, the models trained from the short-running model training steps may not be very accurate. That is, the models with simpler architectures may not discover as many entities in the videos as the more complex model of the long-running model training step (figure 5.18).

Figure 5.18 A model trained from two finished short-running model training steps with very simple models that serve as a baseline. They can only identify a small number of entities, whereas the model trained from the most time-consuming step can identify many more entities.

As a result, we should keep in mind that the models we get early on may not be the best and may only be able to tag a small number of entities, which may not be satisfactory to our users.

When we deploy this end-to-end workflow to real-world applications, we need to consider whether users seeing inference results faster or seeing better results is more important. If the goal is to allow users to see the inference results as soon as a new model is available, they may not see the results they were expecting. However, if users can tolerate a certain period of delay, it’s better to wait for more model training steps to finish. Then, we can be selective about the models we’ve trained and pick the bestperforming models that provide very good entity-tagging results. Whether a delay is acceptable is subject to the requirements of real-world applications.

By using synchronous and asynchronous patterns, we can organize the steps in machine learning workflows from structural and computational perspectives. As a result, data science teams can spend less time waiting for workflows to complete to maximize performance, thus reducing infrastructure costs and idling computational resources. In the next section, we’ll introduce another pattern used very often in realworld systems that can save more computational resources and make workflows run even faster.

5.3.4 Exercises 1 What causes each step of the model training steps to start? 2 Are the steps blocking each other if they are running asynchronously? 3 What do we need to consider when deciding whether we want to use any available trained model as early as possible?

5.4 Step memoization pattern: Skipping redundant workloads via memoized steps

With the fan-in and fan-out patterns in the workflow, the system can execute complex workflows that train multiple machine learning models and pick the most performant models to provide good entity-tagging results in the model serving system. The workflows we’ve seen in this chapter contain only a single data ingestion step. In other words, the data ingestion step in the workflows always executes first before the remaining steps, such as model training and model serving, can begin to process.

Unfortunately, in real-world machine learning applications, the dataset does not always remain unchanged. Now, imagine that new YouTube videos are becoming available and are being added to the YouTube-8M dataset every week. Following our existing workflow architecture, if we would like to retrain the model so that it accounts for the additional videos that arrive on a regular basis, we need to run the entire workflow regularly from scratch—from the data ingestion step to the model serving step—as shown in figure 5.19.

Say the dataset does not change, but we want to experiment with new model architectures or new sets of hyperparameters, which is very common for machine learning practitioners (figure 5.20). For example, we may change the model architecture from simple linear models to more complex models such as tree-based models or convolutional neural networks. We can also stick with the particular model architecture we’ve used and only change the set of model hyperparameters, such as the number of layers and hidden units in each of those layers for neural network models or the maximum depth of each tree for tree-based models. For cases like these, we still need to run the end-to-end workflow, which includes the data ingestion step to reingest the data from the original data source from scratch. Performing data ingestion again is very time-consuming.

Figure 5.19 A diagram of the entire workflow that is re-executed every time the dataset is updated

New videos are added to the original YouTube-8M dataset regularly.

The entire workflow is re-executed every time the dataset has been updated.

Figure 5.20 A diagram where the entire workflow is re-executed every time we experiment with a new model type or hyperparameter even though the dataset has not changed

The dataset does not change.

5.4.1 The problem Machine learning workflows usually start with a data ingestion step. If the dataset is being updated regularly, we may want to rerun the entire workflow to train a fresh machine learning model that takes the new data into account. To do so, we need to execute the data ingestion step every time. Alternatively, if the dataset is not updated, but we want to experiment with new models, we still need to execute the entire workflow, including the data ingestion step. However, the data ingestion step can take a long time to complete depending on the size of the dataset. Is there a way to make this workflow more efficient?

5.4.2 The solution Given how time-consuming data ingestion steps usually are, we probably don’t want to re-execute it to retrain or update our entity tagging models every time the workflow runs. Let’s first think about the root cause of this problem. The dataset of YouTube videos is being updated regularly, and the new data is persisted to the data source on a regular basis (e.g., once a month).

We have two use cases in which we need to re-execute the entire machine learning workflow:  After the dataset has been updated, rerun the workflow to train a new model that uses the updated dataset.  We want to experiment with a new model architecture using that dataset that’s already ingested, which may not have been updated yet.

The fundamental problem is the time-consuming data ingestion step. With the current workflow architecture, the data ingestion step will need to be executed regardless of whether the dataset has been updated.

Ideally, if the new data has not been updated, we don’t want to re-ingest the data that’s already collected. In other words, we would like to execute the data ingestion step only when we know that the dataset has been updated, as shown in figure 5.21.

Now the challenge comes down to determining whether the dataset has been updated. Once we have a way to identify that, we can conditionally reconstruct the machine learning workflow and control whether we want to include a data ingestion step to be re-executed (figure 5.21).

Figure 5.21 A diagram where the data ingestion step is skipped when the dataset has not been updated

The dataset has not been updated yet.

One way to identify whether the dataset has been updated is through the use of cache. Since our dataset is being updated regularly on a fixed schedule (e.g., once a month), we can create a time-based cache that stores the location of the ingested and cleaned dataset (assuming the dataset is located in a remote database) and the timestamp of its last updated time. The data ingestion step in the workflow will then be constructed and executed dynamically based on whether the last updated timestamp is within a particular window. For example, if the time window is set to two weeks, we consider the ingested data as fresh if it has been updated within the past two weeks. The data ingestion step will be skipped, and the following model training steps will use the already-ingested dataset from the location that’s stored in the cache.

Figure 5.22 illustrates the case where a workflow has been triggered, and we check whether the data has been updated within the last two weeks by accessing the cache. If the data is fresh, we skip the execution of the unnecessary data ingestion step and execute the model training step directly.

Figure 5.22 The workflow has been triggered, and we check whether the data has been updated within the last two weeks by accessing the cache. If the data is fresh, we skip the execution of the unnecessary data ingestion step and execute the model training step directly.

The workflow is triggered.

The data has not been updated within the last two weeks.

Writes new cache or reads existing cache that contains timestamp information

The data has been updated within the last two weeks.

The time window can be used to control how old a cache can be before we consider the dataset fresh enough to be used directly for model training instead of re-ingesting the data again from scratch.

Alternatively, we can store some of the important metadata about the data source in the cache, such as the number of records in the original data source currently available. This type of cache is called content-based cache since it stores information extracted from a particular step, such as the input and output information. With this type of cache, we can identify whether the data source has significant changes(e.g., the number of original records has doubled in the data source). If there’s a significant change, it’s usually a signal to re-execute the data ingestion step since the current dataset is very old and outdated. A workflow that illustrates this approach is shown in figure 5.23.

Figure 5.23 The workflow has been triggered, and we check whether the metadata collected from the dataset, such as the number of records in the dataset, has changed significantly. If it’s not significant, we then skip the execution of the unnecessary data ingestion step and execute the model training step directly.

The workflow is triggered.

The number of records in the dataset has updated so a re-ingestion is required.

Writes new cache or reads existing cache that contains metadata about the dataset itself

The number of records in the dataset has not changed.

This pattern, which uses the cache to determine whether a step should be executed or skipped, is called step memoization. With the help of step memoization, a workflow can identify the steps with redundant workloads that can be skipped without being re-executed and thus greatly accelerate the execution of the end-to-end workflow. We’ll apply this pattern in section 9.4.2.

5.4.3 Discussion In real-world machine learning applications, many workloads besides data ingestion are computationally heavy and time-consuming. For example, the model training step uses a lot of computational resources to achieve high-performance model training and can sometimes take weeks to complete. If we are only experimenting with other components that do not require updating the trained model, it might make sense to avoid re-executing the expensive model training step. The step memoization pattern comes in handy when deciding whether you can skip heavy and redundant steps.

If we are creating content-based caches, the decision about the type of information to extract and store in the cache may not be trivial. For example, if we are trying to cache the results from a model training step, we may want to consider using the trained model artifact that includes information such as the type of machine learning model and the set of hyperparameters of the model. When the workflow is executed again, it will decide whether to re-execute the model training step based on whether we are trying the same model. Alternatively, we may store information like the performance statistics (e.g., accuracy, mean-squared error, etc.) to identify whether it’s beyond a threshold and worth training a more performant model.

Furthermore, when applying the step memoization pattern in practice, be aware that it requires a certain level of maintenance efforts to manage the life cycle of the created cache. For example, if 1,000 machine learning workflows run every day with an average of 100 steps for each workflow being memoized, 100,000 caches will be created every day. Depending on the type of information they store, these caches require a certain amount of space that can accumulate rather quickly.

To apply this pattern at scale, a garbage collection mechanism must be in place to delete unnecessary caches automatically to prevent the accumulation of caches from taking up a huge amount of disk space. For example, one simple strategy is to record the timestamp when the cache is last hit and used by a step in a workflow and then scan the existing caches periodically to clean up those that are not used or hit after a long time.

5.4.4 Exercises 1 What type of steps can most benefit from step memoization? 2 How do we tell whether a step’s execution can be skipped if its workflow has been triggered to run again? 3 What do we need to manage and maintain once we’ve used the pattern to apply the pattern at scale?

5.5 Answers to exercises Section 5.2 1 No, because we have no guarantee in what order concurrent copies of those steps will run 2 Training an ensemble model depends on completing other model training steps for the sub-models. We cannot use the fan-in pattern because the ensemble model training step will need to wait for other model training to complete before it can start running, which would require some extra waiting and delay the entire workflow.

Section 5.3 1 Due to the variation in completion times for each model training step in the existing machine learning workflow, the start of each following step, such as model selection and model serving, depends on the completion of the previous step. 2 No, asynchronous steps won’t block each other. 3 We need to consider whether we want to use any available trained model as early as possible from the user’s perspective. We should think about whether it’s more important for users to see inference results faster or see better results. If the goal is to allow users to see the inference results as soon as a new model is available, those results may not be good enough or what users are expecting. Alternatively, if certain delays are acceptable to users, waiting for more model training steps to finish is preferable. You can then be selective about the trained models and pick the best-performing models that will provide very good entity tagging results.

Section 5.4 1 Steps that are time-consuming or require a huge amount of computational resources 2 We can use the information stored in the cache, such as when the cache is initially created or metadata collected from the step, to decide whether we should skip the execution of a particular step. 3 We need to set up a garbage collection mechanism to recycle and delete the created caches automatically.

Summary  Workflow is an essential component in machine learning systems as it connects all other components in a machine learning system. A machine learning workflow can be as easy as chaining data ingestion, model training, and model serving.  The fan-in and fan-out patterns can be incorporated into complex workflows to make them maintainable and composable.  The synchronous and asynchronous patterns accelerate the machine learning workloads with the help of concurrency.  The step memoization pattern improves the performance of workflows by skipping duplicate workloads.