Operation patterns This chapter covers Recognizing areas of improvement in machine learning systems, such as job scheduling and metadata Preventing resource starvation and avoiding deadlocks using scheduling techniques, such as fair-share scheduling, priority scheduling, and gang scheduling Handling failures more effectively to reduce any negative effect on users via the metadata pattern
In chapter 5, we focused on machine learning workflows and the challenges of building them in practice. Workflow is an essential component in machine learning systems as it connects all components in the system. A machine learning workflow can be as easy as chaining data ingestion, model training, and model serving. It can also be very complex when handling real-world scenarios, requiring additional steps and performance optimizations to be part of the entire workflow.
Knowing the tradeoffs we may encounter when making design decisions to meet specific business and performance requirements is essential. I previously introduced a few established patterns commonly adopted in industry. Each pattern can be reused to build simple to complex machine learning workflows that are efficient and scalable. For example, we learned how to use the fan-in and fan-out patterns to build a system to execute complex machine learning workflows (section 5.2). This system can train multiple machine learning models and pick the most performant ones to provide good entity-tagging results. We also used synchronous and asynchronous patterns to make machine learning workflows more efficient and avoid delays due to the long-running model training steps that block other steps (section 5.3).
Since real-world distributed machine learning workflows can be extremely complex, as seen in chapter 5, a huge amount of operational work is involved to help maintain and manage the various components of the systems, such as improvements to system efficiency, observability, monitoring, deployment, etc. These operational work efforts usually require a lot of communication and collaboration between the DevOps and data science teams. For instance, the DevOps team may not have enough domain knowledge in machine learning algorithms used by the data science team to debug any encountered problems or optimize the underlying infrastructure to accelerate the machine learning workflows. For a data science team, the type of computational workload varies, depending on the team structure and the way team members collaborate. As a result, there’s no universal way for the DevOps team to handle the requests of different workloads from the data science team.
Fortunately, operational efforts and patterns can be used to greatly accelerate the end-to-end workflow. They can also reduce maintenance and communication efforts when engineering teams are collaborating with teams of data scientists or machine learning practitioners before the systems become production ready.
In this chapter, we’ll explore some of the challenges involved when performing operations on machine learning systems in practice and introduce a few commonly used patterns. For example, we’ll use scheduling techniques to prevent resource starvation and avoid deadlocks when many team members are working collaboratively in the same cluster with limited computational resources. We will also discuss the benefits of the metadata pattern, which can provide insights into the individual steps in machine learning workflows and help us handle failures more appropriately to reduce any negative effects on users.
6.1 What are operations in machine learning systems? In this chapter, I will focus on operational techniques and patterns that are commonly seen in more than one component or step in a machine learning workflow, instead of patterns that are specific to each individual component. For example, the workflow shown in figure 6.1 includes three failed steps in the multiple model training steps that occur after data ingestion and in the multiple model serving steps that occur after the multiple model training steps. Unfortunately, each step is like a black box, and we don’t know many details about any of them yet. At this point, we only know whether they fail and whether the failures have affected the following steps. As a result, they are really hard to debug.
Figure 6.1 An example workflow where multiple model training steps occur after data ingestion and multiple model serving steps occur after the multiple model training steps. Note the three failed steps.
Three steps failed in this workflow, but we don’t know what the root cause of the failures is just by looking at the workflow at a higher level.
We don’t know what exactly failed here. Perhaps it failed to connect to the database or the workers for model training ran out of memory.
The operation patterns I introduce in this chapter can increase the visibility of the entire workflow to help us understand the root cause of the failures and give us some ideas on how to handle the failures properly. In addition, the increased observability may help us develop improvements in system efficiency that are beneficial to future executions of similar workflows.
What about MLOps? We often hear about MLOps nowadays, which is a term derived from machine learning and operations. It usually means a collection of practices for managing machine learning lifecycles in production, including practices from machine learning and DevOps, to efficiently and reliably deploy and manage machine learning models in production.
MLOps usually require communication and collaboration between DevOps and data science teams. It focuses on improving the quality of production machine learning and embracing automation while maintaining business requirements. The scope of MLOps can be extremely large and varies depending on the context.
Given how large the scope of MLOps can be, depending on the context, I will only focus on a selected set of mature patterns at the time of writing. You can expect some updates to any future versions of this chapter as this field evolves.
6.2 Scheduling patterns: Assigning resources effectively in a shared cluster
Let’s assume we have successfully set up the distributed infrastructure for users to submit distributed model training jobs that are scheduled to run on multiple CPUs by a default scheduler. A scheduler is responsible for assigning computational resources to perform tasks requested by the system. It is designed to keep computational resources busy and allow multiple users to collaborate with shared resources more easily. Multiple users are trying to build models using the shared computational resources in the cluster for different scenarios. For example, one user is working on a fraud detection model that tries to identify fraudulent financial behaviors such as international money laundering. Another user is working on a condition monitoring model that can generate a health score to represent the current condition for industrial assets such as components on trains, airplanes, wind turbines, etc.
Our beginning infrastructure only provides a simple scheduler, which schedules jobs on a first-come, first-served basis, as shown in figure 6.2. For example, the third job is scheduled after the second job has been scheduled, and each job’s computational resources are allocated on scheduling.
Figure 6.2 A diagram of an infrastructure that only provides a simple scheduler, which schedules jobs on a first-come, first-served basis
The current infrastructure uses a simple scheduler that schedules jobs on a first-come, first-served basis.
Job 3 is scheduled after job 2 has been scheduled.
In other words, the users who schedule jobs later must wait for all previously submitted jobs to finish before their model training jobs can start executing. Unfortunately, in the real world, users often want to submit multiple model training jobs to experiment with different sets of models or hyperparameters. These multiple models block other users’ model training jobs from executing since those previously submitted experiments are already utilizing all the available computational resources.
In this case, users must compete for resources (e.g., waking up in the middle of the night to submit model training jobs when fewer users are using the system). As a result, collaboration among team members may not be pleasant. Some jobs include training very large machine learning models, which usually consume a lot of computational resources and thus increase the time other users have to wait for their jobs to execute.
In addition, if we only schedule some of the requested workers for a distributed model training job, the model training cannot execute until all of the requested workers are ready; the nature of the distribution strategy is distributed training with the collective communication pattern. If necessary computational resources are lacking, the job will never start, and the already-allocated computational resources for the existing workers will be wasted.
6.2.1 The problem We have set up a distributed infrastructure for users to submit distributed model training jobs scheduled to run by a default scheduler responsible for assigning computational resources to perform various tasks requested by the users. However, the default scheduler only provides a simple scheduler that schedules jobs on a first-come, firstserved basis. As a result, when multiple users attempt to use this cluster, they often need to wait a long time for available computational resources—that is, until the previously submitted jobs are completed. In addition, distributed model training jobs cannot begin to execute until all of the requested workers are ready due to the nature of the distributed training strategy, such as a collective communication strategy. Are there any alternatives to the existing default scheduler so we could assign the computational resources more effectively in a shared cluster?
6.2.2 The solution In our scenario, the problem starts to occur when multiple users are trying to use the system to submit distributed model training jobs at the same time. Since the jobs are being executed on a first-come, first-served basis, the waiting times for jobs submitted later are long, even when those jobs are submitted by multiple users.
It’s easy to identify different users, so an intuitive solution would be to limit how much of the total computational resources each user is allotted. For example, say there are four users (A, B, C, and D). Once user A submits a job that uses 25% of the total available CPU cycles (https://techterms.com/definition/clockcycle), they cannot submit another job until those allocated resources are released and ready to be allocated to new jobs. Other users could submit jobs independent of how much resources user A is using. For example, if user B starts two processes that use the same amount of resources, those processes will be attributed 12.5% of the total CPU cycles each, giving user B 25% of total resources. Each of the other users still receives 25% of the total cycles. Figure 6.3 illustrates the resource allocations for these four users.
If a new user E starts a process on the system, the scheduler will reapportion the available CPU cycles so that each user gets 20% of the whole (100% / 5 = 20%). The way we schedule our workloads to execute in our cluster in figure 6.3 is called fair-share scheduling. It is a scheduling algorithm for computer operating systems in which the CPU usage is equally distributed among system users or groups, as opposed to equal distribution among processes.
Figure 6.3 The resource allocations for the four users (A, B, C, and D)
The resources are only split among the total available CPU cycles for user A.
User C’s resources are independent of how much resources user A is using.
So far, we have only discussed partitioning resources among the users. When multiple teams are using the system to train their machine learning models and each team has multiple members, we can partition users into different groups and then apply the fair-share scheduling algorithm to both the users and the groups. Specifically, we first divide the available CPU cycles among the groups and then divide further among the users within each group. For example, if three groups contain three, two, and four users, respectively, each group will be able to use 33.3% (100% / 3) of the total available CPU cycles. We can then calculate the available CPU cycles for each user in each group as follows: Group 1—33.3% / 3 users = 11.1% per user Group 2—33.3% / 2 users = 16.7% per user Group 3—33.3% / 4 users = 8.3% per user Figure 6.4 summarizes the resource allocation we calculated for each individual user in the three groups.
Fair-share scheduling would help us resolve the problem of multiple users running distributed training jobs concurrently. We can apply this scheduling strategy at each level of abstraction, such as processes, users, groups, etc. All users have their own pool of available resources without interfering with each other.
However, in some situations, certain jobs should be executed earlier. For example, a cluster administrator would like to submit jobs for cluster maintenance, such as deleting jobs that have been stuck and taking up resources for a long time. Executing these cluster maintenance jobs earlier would help make more computational resources available and thus unblock others from submitting new jobs.
Figure 6.4 A summary of the resource allocation for each user in three groups
Each group has the same amount of allocated resources.
Each user in this group has the same percentage of allocated resources.
Let’s assume the cluster administrator is user 1 in group 1. Two other nonadmin users are also in group 1, as in the previous example. User 2 is running job 1, which is using all of the 11.1% of the CPU cycles allocated to them based on the fair-share scheduling algorithm. Even though user 2 has enough computational power to perform job 1, the job depends on the success of job 2 from user 3. For example, job 2 from user 3 produces a table in the database that job 1 needs to perform a distributed model training task. Figure 6.5 summarizes the resource allocations and usages for each user in the first group.
Unfortunately, job 2 is stuck due to an unstable database connection and keeps trying to reconnect to produce the data that job 1 needs. To fix the problem, the administrator needs to submit job 3 that kills and then restarts the stuck job 2.
Now assume that the admin user 1 is already using 11.1% of the total CPU cycles available. As a result, since maintenance job 3 is submitted later than all previous jobs, it is added to the job queue and waits to be executed when resources are released, based on the first-come, first-served nature of our fair-share scheduling algorithm. As a result, we encounter a deadlock where no job can proceed, as illustrated in figure 6.6.
To fix this problem, we can allow users to assign priorities to each of the jobs so that jobs with higher priority are executed earlier, in contrast to the first-come, first-served nature of the fair-share scheduling algorithm. In addition, the jobs that are already running can be preempted or evicted to make room for jobs with higher priorities if not enough computational resources are available. This way of scheduling jobs based on priorities is called priority scheduling.
Figure 6.5 A summary of resource allocations and usages for each user in the first group
Job 1 depends on a table that job 2 produces in the database.
Figure 6.6 The admin user (user 1) in group 1 is trying to schedule a job to restart the stuck job (job 3) but encounters a deadlock where no job can proceed.
We are already using 11.1% of the total CPU cycles available so the new job 3 is being queued.
Job 1 depends on a table that job 2 produces in the database.
Job 2 is stuck due to unstable database connection and keeps trying to reconnect in order to produce the data that job 1 needs.
Say, for example, four jobs (A, B, C, and D) have been submitted concurrently. Each job has been marked with priorities by the users. Jobs A and C are high priority, whereas job B is low priority, and job D is medium priority. With priority scheduling, jobs A and C will be executed first since they have the highest priorities, followed by the execution of job D with medium priority and, eventually low-priority job B. Figure 6.7 illustrates the order of execution for the four jobs (A, B, C, and D) when priority scheduling is used.
Figure 6.7 The order of execution for the four concurrently submitted jobs (A, B, C, and D) when priority scheduling is used
-
These two jobs are executed first since they have the highest priorities.
-
Job D is executed next right after jobs A and C.
-
Job B is executed last since it has the lowest priority.
Let’s consider another example. Assume three jobs (B, C, and D) with different priorities are submitted concurrently and are executed based on their priorities, similar to the previous example. If another job (job A) with high priority is submitted after job B, which is low priority, has already started running, job B will be preempted, and then job A will start. The computational resources previously allocated to job B will be released and taken over by job A. Figure 6.8 summarizes the order of execution for the four jobs (A, B, C, and D) where the low-priority job B already running is preempted by a new job (job A) with higher priority.
With priority scheduling, we can effectively eliminate the problem we previously encountered, where jobs can only be executed sequentially on a first-come, firstserved basis. Jobs can now be preempted in favor of tasks with high priorities.
However, for distributed machine learning tasks—specifically, model training tasks—we want to ensure that all workers are ready before starting distributed training. Otherwise, the ones that are ready would be waiting for the remaining workers before the training can proceed, which wastes resources.
For example, in figure 6.9, three worker processes in the same process group are performing an allreduce operation. However, two workers are not ready because the underlying distributed cluster is experiencing an unstable network. As a result, two of the processes (processes 1 and 3) that depend on those affected communications would not receive some of the calculated gradient values (v0 and v2) on time (denoted by question marks in figure 6.9), and the entire allreduce operation is stuck until everything is received.
Figure 6.8 The order of execution for the four jobs (A, B, C, and D) where the running low-priority job is preempted by a new job with higher priority
- These three jobs are executed based on their priorities (C → D → B).
- Job A (high priority) is submitted after job B (low priority) has already started running.
- Job B will be preempted, and then job A will start.
Figure 6.9 An example of the allreduce process with an unstable network between the worker processes that blocks the entire model training process
These worker processes won’t start sending gradients until all of them are ready when the network becomes stable.
Two of the processes that depend on those affected communications do not receive some of the calculated gradient values (v0 and v2) on time.
Gang scheduling is usually used to run distributed model training tasks. It ensures that if two or more workers communicate with each other, they will be ready to do so at the same time. In other words, gang scheduling only schedules workers when enough workers are available and ready to communicate.
If they are not gang scheduled, one worker may wait to send or receive a message while the other worker is sleeping, and vice versa. When the workers are waiting for other workers to be ready for communication, we are wasting allocated resources on the workers that are ready, and the entire distributed model training task is stuck.
For example, for collective communication–based distributed model training tasks, all workers must be ready to communicate the calculated gradients and update the models on each worker to complete an allreduce operation. I assume that the machine learning framework does not support elastic scheduling yet, which we will discuss in the next section. As shown in figure 6.10, the gradients are all denoted by question marks since they have not yet arrived in any of those worker processes in the second worker group. All worker processes have not yet started sending the gradients, and they won’t until they all move to the ready state after the network stabilizes.
Figure 6.10 With gang scheduling, the worker processes will not start sending the gradients until they are all in the ready state after the network becomes stable.
All of the worker processes will not start sending the gradients until they are all in a ready state when the network becomes stable.
With gang scheduling, we can make sure not to start any of the worker processes until all workers are ready, so none of them will be waiting for the remaining worker processes. As a result, we can avoid wasting computational resources. Once the network becomes stable, all of the gradients (v0, v1, and v2) arrive on each worker process after a successful allreduce operation, as shown in figure 6.11.
NOTE The details of different types of gang scheduling and their algorithms are out of the scope of this book and will not be discussed here. However, we will be using an existing open source framework to integrate gang scheduling into distributed training in the last part of the book.
Figure 6.11 All of the gradients arrive on each of the worker processes after a successful allreduce operation once the network is stable.
All of the gradients arrive on each of the worker processes after a successful allreduce operation once the network is stable.
By incorporating different scheduling patterns, we are able to address various problems that arise when multiple users are using the infrastructure to schedule different types of jobs. Although we looked at a few specific use cases for these scheduling patterns, the patterns can be found in many systems that require careful management of computational resources, especially when resources are scarce. Many scheduling techniques are applied to even lower-level operating systems to make sure the applications run efficiently and reasonably share resources.
6.2.3 Discussion We’ve seen how fair-share scheduling can help us solve the problem of multiple users running distributed training jobs concurrently. Fair-share scheduling allows us to apply a scheduling strategy at each level of abstraction, such as processes, users, groups, etc. We also discussed priority scheduling, which can be used to effectively eliminate the problem we encounter when jobs can only be executed sequentially on a firstcome, first-served basis. Priority scheduling allows jobs to be executed based on their priority levels, preempting low-priority jobs to make room for high-priority jobs.
With priority scheduling, if a cluster is used by a large number of users, a malicious user could create jobs at the highest possible priority, causing other jobs to be evicted or not get scheduled at all. To deal with this potential problem, administrators of realworld clusters usually enforce certain rules and limits to prevent users from creating a huge number of jobs at high priorities.
We also discussed gang scheduling, which ensures if two or more workers communicate with each other, they will all be ready to communicate at the same time. Gang scheduling is especially helpful for collective communication–based distributed model training jobs where all workers need to be ready to communicate the calculated gradients to avoid wasting computational resources.
Some machine learning frameworks support elastic scheduling (see chapter 3), which allows distributed model training jobs to start with any number of workers available without waiting for all the requested workers to be ready. In this case, gang scheduling is not suitable because we would need to wait for all workers to be ready. Instead, we can begin making significant progress toward model training with elastic scheduling.
Because the number of workers may change during model training, the batch size(sum of the size of mini-batches on each worker) will affect the model training accuracy. In that case, additional modifications to the model training strategy are needed. For example, we can support a customized learning rate scheduler that will account for epoch or batch or adjust the batch size dynamically based on the number of workers. Together with these algorithmic improvements, we can allocate and utilize existing computational resources more wisely and improve the user experience.
In practice, distributed model training jobs greatly benefit from scheduling patterns like gang scheduling. As a result, we can avoid wasting computational resources. However, one problem we might be neglecting is that any of these worker processes scheduled by gang scheduling may fail, leading to unexpected consequences. Often it’s hard to debug these types of failures. In the next section, I’ll introduce a pattern that will make debugging and handling failures easier.
6.2.4 Exercises 1 Can we only apply fair-share scheduling at the user level? 2 Is gang scheduling suitable for all distributed model training jobs?
6.3 Metadata pattern: Handle failures appropriately to minimize the negative effect on users When building the most basic machine learning workflow that includes only data ingestion, model training, and model serving, where each component only appears once as an individual step in the workflow, everything seems pretty straightforward. Each step runs sequentially to reach completion. If any of these steps fail, we pick up where it’s left off. For example, imagine the model training step has failed to take the ingested data (e.g., lost the connection to the database where the ingested data is stored). We can retry the failed step and easily continue model training without rerunning the entire data ingestion process, as shown in figure 6.12.
However, when the workflow gets more complicated, any failures are not trivial to handle. For example, consider the workflow from chapter 5. This workflow trains models via three model training steps that arrive at different accuracies when tagging entities. Then, a model selection step picks the top two models with at least 90% accuracy trained from the first two model training steps, which will be used in the following two separate model serving steps. The results from the two model serving steps are then aggregated via a result aggregation step to present to users.
Figure 6.12 A baseline workflow where the model training step has failed to take the ingested data. We retry the failed step and pick up from the failed step to continue model training without rerunning the entire data ingestion process.
Baseline workflow that includes only data ingestion, model training, and model serving where each of these components only appears once as individual steps in the workflow
If any of the steps fail, we can easily retry the failed step and pick up from what’s left.
Now let’s consider the case where the second and the third model training steps have both failed during execution (e.g., some of the workers allocated for model training are preempted). These two model training steps would have provided both the most and the least accurate model if they had finished successfully, as shown in figure 6.13.
At this point, one might think that we should rerun both steps to proceed to the model selection and model serving steps. However, in practice, since we already wasted some time training part of the models, we may not want to start everything from scratch. It would be much longer before our users can see the aggregated results from our best models. Is there a better way to handle such kinds of failures?
6.3.1 The problem For complicated machine learning workflows, such as the one we discussed in chapter 5, where we want to train multiple models and then select the top-performing models for model serving, the decision on which strategy to use to handle failures of certain steps due to real-world requirements is not always trivial. For example, when two out of three model training steps fail due to preempted workers, we don’t want to start training those models from scratch, which greatly increases the time needed to complete the workflow. How do we handle these failures appropriately so the negative effect on users can be minimized?
Figure 6.13 A machine learning workflow that trains models with different accuracies when tagging entities. The model selection step identifies the top two models with at least 90% accuracy to be used for model serving. The accuracies are crossed out in these two steps because the steps failed without arriving at the expected accuracies. The results from the two model serving steps are then aggregated to present to users.
Three different model training steps train different models that arrive at different accuracies when tagging entities.
This step picks the top two models that will be used in the following two separate model serving steps.
The results from the two model serving steps are then aggregated via a result aggregation step to present to users.
These two model training steps both failed during execution (e.g., some of the workers allocated for model training are preempted).
These two model training steps would have provided both the most and the least accurate model if they finished successfully.
6.3.2 The solution Whenever we encounter a failure in a machine learning workflow, we should first understand the root cause (e.g., loss of network connections, lack of computational resources, etc). Knowing the root cause is important because we need to understand the nature of the failure to predict whether retrying the failed steps would help. If the failures are due to some long-lasting shortages that could very likely lead to repetitive failures when retrying, we could better utilize the computational resources to run some other tasks. Figure 6.14 illustrates the difference in the effectiveness of retrying for permanent and temporary failures. When we retry the model training step when encountering permanent failures, the retries are ineffective and lead to repetitive failures.
For example, in our case, we should first check whether the dependencies of a model training step are met, such as whether the ingested data from the previous step is still available. If the data has been persisted to a local disk to a database, we can proceed to model training. However, if the data was located in memory and lost when the model training step failed, we cannot start model training without ingesting the data again. Figure 6.15 shows the process of restarting the data ingestion step when there’s a permanent failure during model training.
Figure 6.14 The difference in the effectiveness of retrying for permanent and temporary failures
Figure 6.15 The process of restarting the data ingestion step when a permanent failure occurs during model training
If the data was located in memory and was lost when the model training step failed, then we cannot start model training without starting ingesting the data again.
Similarly, if the model training step fails due to preempted training workers or out-ofmemory problems, we need to make sure we still have sufficient computational resources allocated to rerun the model training step.
However, we won’t know what information to analyze to determine the root cause unless we intentionally record it as metadata during the runtime of each step in the entire machine learning workflow. For example, for each model training step, we can record metadata on the availability of the ingested data and whether different computational resources, such as memory and CPU usage, exceeded the limit before the step failed.
Figure 6.16 is a workflow where the model training step failed. Metadata is collected every 5 minutes on memory usage (in megabytes) and the availability of the training data (yes/no) during the runtime of this step. We can notice a sudden huge memory spike from 23 MB to 200 MB after 30 minutes. In this case, we can retry this step with an increase in requested memory, and it would then successfully produce a trained model that will be used for the next model serving step.
Figure 6.16 An example workflow where the model training step failed, with the metadata collected showing an unexpected memory spike during runtime
There’s a huge memory spike from 23 MB to 200 MB suddenly after 30 minutes.
In practice, for complex workflows like in figure 6.13, even when we know all the dependencies of model training steps are met (e.g., we have enough computational resources and a good database connection to access the data source), we should also think about whether we want to handle the failures and how we’d like to handle them. We’ve spent a lot of time on the training steps already, but now, the steps have suddenly failed, and we’ve lost all the progress. In other words, we don’t want to start re-training all the models from scratch, which may add considerable time before we can deliver the aggregated results from our best models to users. Is there a better way to handle this without a huge effect on our user experience?
In addition to the metadata we’ve recorded for each of the model training steps, we could save more useful metadata that can be used to figure out whether it’s worth rerunning all the model training steps. For example, the model accuracy over time indicates whether the model is being trained effectively.
Model accuracy that remains steady or even decreases (from 30% to 27%, as shown in figure 6.17) may indicate that the model already converges and continuing training would no longer improve model accuracy. In this example, even though two model training steps fail, it’s not necessary to retry the third model training step from scratch since it would lead to a model that converges fast but with low accuracy. Another example of metadata that can be potentially useful is the percentage of completed model training (e.g., if we’ve iterated through all the requested number of batches and epochs, the completion is 100%).
Figure 6.17 An example workflow where two model training steps fail and one has decreasing model accuracy
The model accuracy decreases, which might indicate that the model already converges and continuing training would no longer improve the model accuracy.
It’s not necessary to retry the third model training step from scratch since it would lead to a model that converges fast but with low accuracy.
Once we have this additional metadata about model training steps, we can tell how well each started model training step progresses. For example, for the workflow in figure 6.18, we could potentially conclude ahead of time that the third model training step was progressing very slowly (only 1% of completion every 30 minutes) due to a smaller amount of allocated computational resources or more complex model architecture. We know that it’s highly likely that, given the limited time, we end up with a model with low accuracy. As a result, we can disregard this model training step in favor of allocating more computational resources to the other model training steps with more potential, which leads to more accurate models faster.
Recording these metadata may help us derive more insights specific to each of the failed steps in the end-to-end machine learning workflow. We can then decide on a strategy to handle the failed steps appropriately to avoid wasting computational resources and minimize the effect on existing users. The metadata patterns provide great visibility into our machine learning pipelines. They can also be used to search, filter, and analyze the artifacts produced in each step in the future if we run a lot of pipelines on a regular basis. For example, we might want to know which models are performant or which datasets contribute the most to those models based on the historical training metrics.
Figure 6.18 An example workflow where two model training steps fail. One is disregarded because it is progressing very slowly, and the model will likely have low accuracy given the limited time.
This model training step was progressing very slowly due to smaller amount of allocated computational resources or more complex model architecture.
We know that it is highly likely to end up with a model with low accuracy given the limited time.
As a result, we can disregard this model training step in favor of allocating more computational resources to the model training steps with more potential, which leads to more accurate models faster.
6.3.3 Discussion With the help of the metadata pattern, we can gain additional insights into the individual steps in machine learning workflows. Then, if any fail, we can respond based on what’s beneficial to our users and thus reduce any negative effect due to the step failures.
One common type of metadata is the various network performance (http://mng.bz/D4lR) metrics while the model is being trained (e.g., bandwidth, throughput, latency). This type of information is very useful for detecting when certain workers experience poor network performance that blocks the entire training process. We can take down slow workers and start new workers to continue training, assuming the underlying machine learning frameworks support elastic scheduling and fault-tolerance (see chapter 3). For example, in figure 6.19, based on the metadata, the worker on the right-hand side has extremely high latency (10 times the latency of the other workers), which slows down the entire model training process. Ideally, this worker would be taken down and restarted.
Figure 6.19 An example parameter server–based model training where the worker on the right-hand side has extremely high latency (10 times the latency of the other workers), which slows down the entire model training process
This worker node has extremely high latency (10 times the latency of the other workers) that slows down the entire model training process.
One additional benefit of introducing the metadata pattern to our machine learning workflows is to use the metadata recorded to establish relationships between the individual steps or across different workflows. For example, modern model management tools can use the recorded metadata to help users build the lineage of the trained models and visualize what individual steps/factors contributed to the model artifacts.
6.3.4 Exercises 1 If the training step failed due to the loss of training data source, what should we do? 2 What type of metadata can be collected if we look at individual workers or parameter servers?
6.4 Answers to exercises
Section 6.2 1 No, we can apply this scheduling strategy at each level of abstraction, such as processes, users, groups, etc. 2 No, some machine learning frameworks support elastic scheduling, which allows distributed model training jobs to start with any number of workers available without waiting for all the requested workers to be ready for communication. In this case, gang scheduling is not suitable.
Section 6.3 1 We should rerun data ingestion before retrying the model training step since this failure is permanent, and simply retrying would lead to repetitive failures. 2 Various network performance metrics while the model is being trained (e.g., bandwidth, throughput, and latency). This type of information is very useful when we want to detect when workers experience poor network performance that blocks the entire training process.
Summary There are different areas of improvement related to operations in machine learning systems, such as job scheduling and metadata. Various scheduling patterns, such as fair-share scheduling, priority scheduling, and gang scheduling, can be used to prevent resource starvation and avoid deadlocks. We can collect metadata to gain insights from machine learning workflows and handle failures more appropriately to reduce any negative effects on users.