Hao Liang's Blog | Lianghao Blog

Part 1 Basic concepts and background

This part of the book will provide some background and concepts related to distributed machine learning systems. We will start by discussing the growing scale of machine learning applications (given users’ demand for faster responses to meet real-life requirements), machine learning pipelines, and model architectures. Then we will talk about what a distributed system is, describe its complexity, and introduce one concrete example pattern that’s often used in distributed systems. In addition, we will discuss what distributed machine learning systems are, examine similar patterns that are often used in those systems, and talk about some real-life scenarios. At the end of this part, we will take a glance at what we’ll be learning in this book.

第一部分基本概念和背景

此章节将提供一些与分布式机器学习系统相关的背景和概念。我们将首先讨论不断增长的机器学习应用规模（考虑到用户需要更快地响应以满足实际需求）、机器学习流水线和模型架构。然后讨论什么是分布式系统，描述其复杂性，并介绍分布式系统中经常使用的一种具体示例模式。此外，我们还将讨论什么是分布式机器学习系统，研究在这些系统中经常使用的类似模式，并聊一聊实际的使用场景。最后，我们将简要介绍一下本书的内容。

1 Introduction to distributed machine learning systems This chapter covers  Handling the growing scale in large-scale machine learning applications  Establishing patterns to build scalable and reliable distributed systems  Using patterns in distributed systems and building reusable patterns

Machine learning systems are becoming more important nowadays. Recommendation systems learn to generate recommendations of potential interest with the right context according to user feedback and interactions, anomalous event detection systems help monitor assets to prevent downtime due to extreme conditions, and fraud detection systems protect financial institutions from security attacks and malicious fraud behaviors.

There is increasing demand for building large-scale distributed machine learning systems. If a data analyst, data scientist, or software engineer has basic knowledge of and hands-on experience in building machine learning models in Python and wants to take things a step further by learning how to build something more robust, scalable, and reliable, this book is the right one to read. Although experience in production environments or distributed systems is not a requirement, I expect readers in this position to have at least some exposure to machine learning applications running in production and should have written Python and Bash scripts for at least one year.

Being able to handle large-scale problems and take what’s developed on your laptop to large distributed clusters is exciting. This book introduces best practices in various patterns that help you speed up the development and deployment of machine learning models, use automations from different tools, and benefit from hardware acceleration. After reading this book, you will be able to choose and apply the correct patterns for building and deploying distributed machine learning systems; use common tooling such as TensorFlow (https://www.tensorflow.org), Kubernetes (https://kubernetes.io), Kubeflow (https://www.kubeflow.org), and Argo Workflows appropriately within a machine learning workflow; and gain practical experience in managing and automating machine learning tasks in Kubernetes. A comprehensive, hands-on project in chapter 9 provides an opportunity to build a real-life distributed machine learning system that uses many of the patterns we learn in the second part of the book. In addition, supplemental exercises at the end of some sections in the following chapters recap what we’ve learned.

1 分布式机器学习系统简介本章涵盖  通过大规模机器学习应用处理不断增长的数据规模  建立模式以构建可靠和可扩展的分布式系统  在分布式系统中使用模式并构建可重用的模式

如今，机器学习系统变得越来越重要。推荐系统根据用户反馈和交互来学习如何在适当的上下文中生成潜在兴趣的推荐，异常事件检测系统帮助监控资产以防止极端条件导致的停机，欺诈检测系统保护金融机构免受安全攻击和恶意欺诈行为的影响。

人们对构建大规模分布式机器学习系统的需求日益增加。如果数据分析师、数据科学家或软件工程师拥有使用 Python 构建机器学习模型的基础知识和实践经验，并且希望通过学习如何构建健壮、可靠和可扩展的系统来更进一步，那么这是一本适合阅读的书。虽然生产环境或分布式系统的经验不是必需的，但我希望这个职位的读者至少对生产中运行的机器学习应用程序有一定的了解，并且至少写过一年 Python 和 Bash 脚本。

能够处理大规模问题，并将笔记本电脑上开发的内容迁移到大型分布式集群中，是令人兴奋的。本书介绍了各种模式的最佳实践，可帮助您加快机器学习模型的开发和部署，使用不同的工具实现自动化，并从硬件加速中获益。读完本书后，您将能够选择并应用正确的模式来构建和部署分布式机器学习系统；在机器学习中适当使用常见工具，例如 TensorFlow (https://www.tensorflow.org)、Kubernetes (https://kubernetes.io)、Kubeflow (https://www.kubeflow.org) 和 Argo Workflows；并获得在 Kubernetes 中管理和自动化机器学习任务的实践经验。在第九章中，的一个完整的实践项目为我们提供了构建真实环境的分布式机器学习系统的机会，该系统使用了我们在本书第二部分中学到的许多模式。此外，在接下来的章节中的一些部分在结束时，补充练习的环节将会回顾我们所学到的内容。

1.1 Large-scale machine learning The scale of machine learning applications has become unprecedentedly large. Users are demanding faster responses to meet real-life requirements, and machine learning pipelines and model architectures are getting more complex. In this section, we’ll talk about the growing scale in more detail and what we can do to address the challenges.

1.1.1 The growing scale As the demand for machine learning grows, the complexity involved in building machine learning systems is increasing as well. Machine learning researchers and data analysts are no longer satisfied with building simple machine learning models on their laptops on gigabytes of Microsoft Excel sheets. Due to the growing demand and complexity, machine learning systems have to be built with the ability to handle the growing scale, including the increasing volume of historical data; frequent batches of incoming data; complex machine learning architectures; heavy model serving traffic; and complicated end-to-end machine learning pipelines.

1.1 大规模机器学习机器学习应用的规模已经变得空前庞大。用户要求更快的响应以满足现实需求，机器学习流水线和模型架构变得越来越复杂。在本节中，我们将更详细地讨论不断增长的规模，以及我们可以采取哪些措施来应对这些挑战。

1.1.1 不断扩大的规模随着机器学习需求的增长，构建机器学习系统的复杂性也在增加。机器学习研究人员和数据分析师不再满足于在笔记本电脑上的千兆字节 Microsoft Excel 工作表上构建简单的机器学习模型。由于需求和复杂性的不断增长，机器学习系统必须具备处理不断增长规模的能力，包括不断增加的历史数据量；频繁的批量数据传入；复杂的机器学习架构；大量的模型服务流量；以及复杂的端到端机器学习流水线。

Let’s consider two scenarios. First, imagine that you have a small machine learning model that has been trained on a small dataset (less than 1 GB). This approach might work well for your analysis at hand because you have a laptop with sufficient computational resources. But you realize that the dataset grows by 1 GB every hour, so the original model is no longer useful and predictive in real life. Suppose that you want to build a time-series model that predicts whether a component of a train will fail in the next hour to prevent failures and downtime. In this case, we have to build a machine learning model that uses the knowledge gained from the original data and the most recent data that arrives every hour to generate more accurate predictions. Unfortunately, your laptop has a fixed amount of computational resources and is no longer sufficient for building a new model that uses the entire dataset.

Second, suppose that you have successfully trained a model and developed a simple web application that uses the trained model to make predictions based on the user’s input. The web application may have worked well in the beginning, generating accurate predictions, and the user was quite happy with the results. This user’s friends heard about the good experience and decided to try it as well, so they sat in the same room and opened the website. Ironically, they started seeing longer delays when they tried to see the prediction results. The reason for the delays is that the single server used to run the web application can’t handle the increasing number of user requests as the application gets more popular. This scenario is a common challenge that many machine learning applications will encounter as they grow from beta products to popular applications. These applications need to be built on scalable machine learning system patterns to handle the growing scale of throughput.

让我们考虑两种情况。首先，假设您有一个小型机器学习模型，它在小型数据集（小于 1 GB）上进行了训练。这种方法可能对您手头的分析很实用，因为您拥有一台具有足够计算资源的笔记本电脑。但当您意识到数据集会每小时增长 1 GB时，原始模型的训练就不再实用和具有预测性。假设您想要构建一个时间序列模型来预测模型训练的某个组件是否会在下一小时内发生故障，以防止故障和停机。在这种情况下，我们必须建立一个机器学习模型，该模型使用从原始数据和每小时到达的最新数据中获得的信息来生成更准确的预测。不幸的是，您的笔记本电脑的计算资源十分有限，不足以使用完整的数据集来训练新模型。

其次，假设您已经成功训练了一个模型并开发了一个简单的 Web 应用程序，该应用程序使用训练好的模型根据用户的输入进行预测。 Web 应用程序在开始时可能运行良好，生成了准确的预测，并且用户对结果非常满意。这位用户的朋友听说这个应用体验不错，也决定尝试一下，于是他们在同一个房间里打开了网站。令人哭笑不得的是，当他们试图查看预测结果时，延迟变高了。延迟的原因是，随着应用程序越来越受欢迎，用于运行 Web 应用程序的单个服务器无法处理越来越多的用户请求。这是许多机器学习应用程序从测试版发展到热门应用程序时将遇到的常见挑战。这些应用程序需要建立在可扩展的机器学习系统模式之上，以处理不断增长的吞吐量。

1.1.2 What can we do? When the dataset is too large to fit in a single machine, as in the first scenario in section 1.1.1, how can we store the large dataset? Perhaps we can store different parts of the dataset on different machines and then train the machine learning model by sequentially looping through the various parts of the dataset on different machines.

If we have a 30 GB dataset like the one in figure 1.1, we can divide it into three partitions of 10 GB data, with each partition sitting on a separate machine that has enough disk storage. Then, we can consume the partitions one by one without having to train the machine learning model by using the entire dataset at the same time.

Figure 1.1 An example of dividing a large dataset into three partitions on three separate machines that have sufficient disk storage

Then, we might ask what will happen if looping through different parts of the dataset is quite time-consuming. Assume that the dataset at hand has been divided into three partitions. As illustrated in figure 1.2, first, we initialize the machine learning model on the first machine, and then we train it, using all the data in the first data partition. Next, we transfer the trained model to the second machine, which continues training by using the second data partition. If each partition is large and time-consuming, we’ll spend a significant amount of time waiting.

Figure 1.2 An example of training the model sequentially on each data partition

In this case, we can think about adding workers. Each worker is responsible for consuming each of the data partitions, and all workers train the same model in parallel without waiting for others. This approach is definitely good for speeding up the model training process. But what if some workers finish consuming the data partitions that they are responsible for and want to update the model at the same time? Which of the worker’s results (gradients) should we use to update the model first? Then, we must consider the conflicts and tradeoffs between performance and model quality. In figure 1.2, if the data partition that the first worker uses has better quality due to a more rigorous data collection process than the one that the second worker uses, using the first worker’s results first would produce a more accurate model. On the other hand, if the second worker has a smaller partition, it could finish training faster, so we could start using that worker’s computational resources to train a new data partition. When more workers are added, such as the three workers shown in figure 1.2, the conflicts in completion time for data consumption by different workers become even more obvious.

Similarly, if the application that uses the trained machine learning model to make predictions observes much heavier traffic, can we simply add servers, with each new server handling a certain percentage of the traffic? Unfortunately, the answer is not that simple. This naive solution would need to take other things into consideration, such as deciding the best load balancer strategy and processing duplicate requests in different servers.

We will learn more about handling these types of problems in the second part of the book. For now, the main takeaway is that we have established patterns and best practices to deal with certain situations, and we will use those patterns to make the most of our limited computational resources.

1.1.2 解决方案当数据集太大而无法在单台机器中存储时，如 1.1.1 中的第一个场景中所示，大数据集应该如何存储呢？也许我们可以将数据集的不同部分存储在不同的机器上，然后通过在不同机器上依次读取数据集的各个部分来训练模型。

如果我们有一个如图 1.1 所示的 30 GB 数据集，我们可以将其分为 3 个 10 GB 的数据分区，每个分区位于具有足够磁盘存储空间的单独机器上。然后，我们可以依次使用这些分区，无需同时使用整个数据集来训练模型。

图 1.1 在具有足够磁盘空间的 3 个独立机器上将大型数据集划分为 3 个分区的示例

如果依次访问数据集的这几个分区耗时非常长会怎样呢？假设当前数据集已分为 3 个分区。如图 1.2 所示，首先，我们在第一台机器上初始化模型，然后使用第一个数据分区中的所有数据对其进行训练。接下来，我们将训练好的模型上传到第二台机器上，该机器使用第二个数据分区继续进行训练。如果每个分区数据量都很大并且训练非常耗时，我们将花费大量时间等待。

图 1.2 在每个数据分区上按顺序依次训练模型的示例

在这种情况下，我们可以考虑增加工作节点（workers）。每个工作节点使用一个数据分区，所有工作节点并行训练同一模型，无需等待其他节点训练完成。这种方法有利于加快模型训练的过程。但是，如果一些工作节点已经完成了他们所负责的数据分区的消费，并希望同时更新模型，应该怎么做呢？我们应该先使用哪个工作节点的结果（梯度）来更新模型呢？接下来，我们必须权衡性能和模型质量之间的冲突。在图 1.2 中，如果第一个工作节点使用的数据分区由于比第二个工作节点使用的数据分区进行数据收集的过程更加严格，而具有更好的质量，那么首先使用第一个工作节点的结果来更新模型将更加准确。另一方面，如果第二个工作节点的分区较小，可以更快地完成训练，我们则可以使用该工作节点的计算资源来训练新的数据分区。当添加更多的工作节点时，如图 1.2 所示的 3 个工作节点，不同工作节点的数据消费完成时间差异所产生的冲突变得更加明显。

同样，如果使用训练好的模型进行预测的应用程序观测到了大流量，我们是否可以简单地添加服务器，让每个新服务器处理一定比例的流量呢？不幸的是，答案并非如此简单。这种简单的解决方案需要考虑其他因素，例如：如何决定最佳的负载均衡器策略？如何处理不同服务器中的重复请求？

我们将在本书的第二部分中了解处理此类问题的更多信息。目前，我们已经了解了处理某些情况所使用的模式和最佳实践，并且学习到如何使用这些模式来充分利用有限的计算资源。

1.2 Distributed systems A single machine or laptop can’t satisfy the requirements for training a large machine learning model with a large amount of data. We need to write programs that can run on multiple machines and be accessed by people all over the world. In this section, we’ll talk about what a distributed system is and discuss one concrete example pattern that’s often used in distributed systems.

1.2.1 What is a distributed system? Computer programs have evolved from being able to run on only one machine to working with multiple machines. The increasing demand for computing power and the pursuit of higher efficiency, reliability, and scalability have boosted the advancement of large-scale data centers that consist of hundreds or thousands of computers communicating via the shared network, which have resulted in the development of distributed systems. A distributed system is one in which components are located on different networked computers and can communicate with one another to coordinate workloads and work together via message passing.

Figure 1.3 illustrates a small distributed system consisting of two machines communicating with each other via message passing. One machine contains two CPUs, and the other machine contains three CPUs. Obviously, a machine contains computational resources other than the CPUs; we use only CPUs here for illustration purposes. In real-world distributed systems, the number of machines can be extremely large—tens of thousands, depending on the use case. Machines with more computational resources can handle larger workloads and share the results with other machines.

Figure 1.3 An example of a small distributed system consisting of two machines with different amounts of computational resources communicating with each other via message passing

1.2.2 The complexity and patterns These distributed systems can run on multiple machines and be accessed by users all over the world. They are often complex and need to be designed carefully to be more reliable and scalable. Bad architectural considerations can lead to problems, often on a large scale, and result in unnecessary costs.

Lots of good patterns and reusable components are available for distributed systems. The work-queue pattern in a batch processing system, for example, makes sure that each piece of work is independent of the others and can be processed without any interventions within a certain amount of time. In addition, workers can be scaled up and down to ensure that the workload can be handled properly.

Figure 1.4 depicts seven work items, each of which might be an image that needs to be modified to grayscale by the system in the processing queue. Each of the three existing workers takes two to three work items from the processing queue, ensuring that no worker is idle to avoid waste of computational resources and maximizing the performance by processing multiple images at the same time. This performance is possible because each work item is independent of the others.

Figure 1.4 An example of a batch processing system using the work-queue pattern to modify images to grayscale

1.2 分布式系统单台机器或笔记本电脑无法满足大数据集的大模型训练要求。我们需要编写可以在多台机器上运行并可供世界各地的用户访问的程序。在本节中，我们将讨论什么是分布式系统，以及分布式系统中常用的一种具体示例模式。

1.2.1 分布式系统基本概念如今，计算机程序已经从只能在一台机器上运行演变为可在多台机器上同时运行。随着人们对计算能力日益增长的需求以及对更高的效率、可靠性和可扩展性的追求，推动了由成百上千台计算机通过网络进行通信的大规模数据中心的发展，从而促成了分布式系统的发展。分布式系统是一种组件分布在不同联网计算机上的系统，组件间可以相互通信以协调工作负载，并通过消息传递协同工作。

图 1.3 展示了一个由两台机器通过消息传递相互通信的小型分布式系统。其中一台机器包含 2 个 CPU，另一台机器包含 3 个 CPU。显然，机器除了 CPU 之外还包含其他计算资源；我们在这里仅使用 CPU 进行说明。在现实的分布式系统中，机器的数量可能非常庞大——达到数万台，这取决于具体使用场景。拥有更多计算资源的机器可以处理更大的工作负载并与其他机器共享计算结果。

图 1.3 一个由两台具有不同计算资源量的机器组成的小型分布式系统，并通过消息传递相互通信的示例

1.2.2 复杂性和模式这些分布式系统可以运行在多台机器上，并可供世界各地的用户访问。它们通常很复杂，需要详尽地设计才能更加可靠和可扩展。糟糕的架构设计可能会导致一些问题（通常是大规模问题）和不必要的成本耗费。

分布式系统中有许多优秀的模式和可复用的组件。例如，批处理系统中的工作队列模式确保了每一项工作都是相互独立的，并且可以在一定时间内不受任何干扰地运行。此外，可以通过增减工作节点的数量来确保工作负载能够被恰当地使用。

图 1.4 描述了 7 个待处理的工作项，每个工作项是将要被系统在处理队列中转换为灰度的图像。现有的 3 个工作节点中，每个工作节点都从处理队列中获取 2 到 3 个待处理工作项，确保没有工作节点处于空闲状态，以避免计算资源的浪费，并通过同时处理多个图像来最大化性能。这种处理方式是可行的，因为每个工作项都是相互独立的。

图 1.4 使用工作队列模式将图像转换为灰度的批处理系统示例

1.3 Distributed machine learning systems Distributed systems are useful not only for general computing tasks but also for machine learning applications. Imagine that we could use multiple machines with large amounts of computational resources in a distributed system to consume parts of the large dataset, store different partitions of a large machine learning model, and so on. Distributed systems can greatly speed up machine learning applications with scalability and reliability in mind. In this section, we’ll introduce distributed machine learning systems, present a few patterns that are often used in those systems, and talk about some real-life scenarios.

1.3.1 What is a distributed machine learning system? A distributed machine learning system is a distributed system consisting of a pipeline of steps and components that are responsible for different steps in machine learning applications, such as data ingestion, model training, and model serving. It uses patterns and best practices similar to those of a distributed system, as well as patterns designed specifically to benefit machine learning applications. Through careful design, a distributed machine learning system is more scalable and reliable for handling large-scale problems, such as large datasets, large models, heavy model serving traffic, and complicated model selection or architecture optimization.

1.3.2 Are there similar patterns? To handle the increasing demand for and scale of machine learning systems that will be deployed in real-life applications, we need to design the components in a distributed machine learning pipeline carefully. Design is often nontrivial, but using good patterns and best practices allows us to speed the development and deployment of machine learning models, use automations from different tools, and benefit from hardware accelerations.

There are similar patterns in distributed machine learning systems. As an example, multiple workers can be used to train the machine learning model asynchronously, with each worker being responsible for consuming certain partitions of the dataset. This approach, which is similar to the work-queue pattern used in distributed systems, can speed up the model training process significantly. Figure 1.5 illustrates how we can apply this pattern to distributed machine learning systems by replacing the work items with data partitions. Each worker takes some data partitions from the original data stored in a database and then uses them to train a centralized machine learning model.

Figure 1.5 An example of applying the work-queue pattern in distributed machine learning systems

Another example pattern commonly used in machine learning systems instead of general distributed systems is the parameter server pattern for distributed model training. As shown in figure 1.6, the parameter servers are responsible for storing and updating a particular part of the trained model. Each worker node is responsible for taking a particular part of the dataset that will be used to update a certain part of the model parameters. This pattern is useful when the model is too large to fit in a single server and dedicated parameter servers for storing model partitions without allocating unnecessary computational resources.

Figure 1.6 An example of applying the parameter server pattern in a distributed machine learning system

1.3 分布式机器学习系统分布式系统不仅适用于一般计算任务，也适用于机器学习应用。想象一下，我们可以在分布式系统中使用多台具有大量计算资源的机器来摄取大型数据集的部分数据，使用不同的数据分区来存储大模型，等等。分布式系统可以大大加快机器学习应用程序的速度，同时兼顾可扩展性和可靠性。在本章节中，我们将介绍分布式机器学习系统，介绍这些系统中经常使用的一些模式，并讨论一些现实生活场景。

1.3.1 分布式机器学习系统基本概念分布式机器学习系统是由一系列步骤和组件组成的分布式系统，这些步骤和组件负责机器学习应用程序中的不同阶段，例如：数据摄取、模型训练和模型服务。它使用与分布式系统类似的模式和最佳实践，以及专门为机器学习应用程序设计的模式。通过精心设计，分布式机器学习系统在处理大规模问题（如大数据集、大模型、大模型服务流量以及复杂的模型选择或架构优化）时具有更高的可扩展性和可靠性。

1.3.2 类似的模式为了满足实际应用中部署机器学习系统日益增长的需求和规模，我们需要仔细设计分布式机器学习流水线中的组件。设计大规模分布式机器学习系统并非易事，但使用良好的模式和最佳实践使我们能够加快模型的开发和部署，使用不同的工具实现自动化，并从硬件加速中受益。

分布式机器学习系统也有类似的模式。例如，可以使用多个工作节点异步训练模型，每个工作节点负责消耗数据集的部分数据分区。这种方法类似于分布式系统中使用的工作队列模式，可以显着加快模型训练过程。图 1.5 说明了我们如何通过用数据分区替换工作项来将此模式应用于分布式机器学习系统。每个工作节点从数据库中存储的原始数据中获取部分数据分区，然后使用它们来进行模型训练。

图1.5 在分布式机器学习系统中应用工作队列模式的示例

机器学习系统中常用的另一种示例模式（不常用于一般的分布式系统）是用于分布式模型训练的参数服务器模式如图1.6所示，参数服务器负责存储和更新训练模型的特定部分。每个工作节点负责获取处理数据集的特定部分，然后更新模型参数的特定部分。当模型太大而无法装入单个服务器时，这种模式非常适用，它还可以使用专用参数服务器来存储模型分区，而无需分配不必要的计算资源。

图 1.6 在分布式机器学习系统中应用参数服务器模式的示例

Part 2 of this book illustrates patterns like these. For now, keep in mind that some patterns in distributed machine learning systems also appear in general-purpose distributed systems, as well as patterns specially designed to handle machine learning workloads at large scale.

1.3.3 When should we use a distributed machine learning system? If the dataset is too large to fit on our local laptops, as illustrated in figures 1.1 and 1.2, we can use patterns such as data partitioning or introduce additional workers to speed up model training. We should start thinking about designing a distributed machine learning system when any of the following scenarios occurs:  The model is large, consisting of millions of parameters that a single machine cannot store and that must be partitioned on different machines.  The machine learning application needs to handle increasing amounts of heavy traffic that a single server can no longer manage.  The task at hand involves many parts of the model’s life cycle, such as data ingestion, model serving, data/model versioning, and performance monitoring.  We want to use many computing resources for acceleration, such as dozens of servers that have many GPUs each. If any of these scenarios occur, it’s usually a sign that a well-designed distributed machine learning system will be needed in the near future.

1.3.4 When should we not use a distributed machine learning system? Although a distributed machine learning system is helpful in many situations, it is usually harder to design and requires experience to operate efficiently. Additional overhead and tradeoffs are involved in developing and maintaining such a complicated system. If you encounter any of the following cases, stick with a simple approach that already works well:  The dataset is small, such as a CSV file smaller than 10 GBs.  The model is simple and doesn’t require heavy computation, such as linear regression.  Computing resources are limited but sufficient for the tasks at hand.

1.4 What we will learn in this book In this book, we’ll learn to choose and apply the correct patterns for building and deploying distributed machine learning systems to gain practical experience in managing and automating machine learning tasks. We’ll use several popular frameworks and cutting-edge technologies to build components of a distributed machine learning workflow, including the following:  TensorFlow (https://www.tensorflow.org)  Kubernetes (https://kubernetes.io)  Kubeflow (https://www.kubeflow.org)  Docker (https://www.docker.com)  Argo Workflows (https://argoproj.github.io/workflows/) A comprehensive hands-on project in the last part of the book consists of an end-to-end distributed machine learning pipeline system. Figure 1.7 is the architecture diagram of the system that we will be building. We will gain hands-on experience implementing many of the patterns covered in the following chapters. Handling large-scale problems and taking what we’ve developed on our personal laptops to large distributed clusters should be exciting.

We’ll be using TensorFlow with Python to build machine learning and deep learning models for various tasks, such as building useful features based on a real-life dataset, training predictive models, and making real-time predictions. We’ll also use Kubeflow to run distributed machine learning tasks in a Kubernetes cluster. Furthermore, we will use Argo Workflows to build a machine learning pipeline that consists of many important components of a distributed machine learning system. The basics of these technologies are introduced in chapter 2, and we’ll gain hands-on experience with them in part 2. Table 1.1 shows the key technologies that will be covered in this book and example uses.

Figure 1.7 An architecture diagram of the end-to-end machine learning system that we will be building in the last part of the book

Table 1.1 The technologies covered in this book and their uses

Before we dive into details in chapter 2, I recommend that readers have basic knowledge of and hands-on experience in building machine learning models in Python. Although experience in production environments or distributed systems is not a requirement, I expect readers in this position to have at least some exposure to machine learning applications running in production and to have written Python and Bash scripts for at least one year. In addition, understanding the basics of Docker and being able to manage images/containers by using the Docker command-line interface is required. Familiarity with basic YAML syntax is helpful but not required; the syntax is intuitive and should be easy to pick up along the way. If most of these topics are new to you, I suggest that you learn more about them from other resources before reading further.

Summary  Machine learning systems deployed in real-life applications usually need to handle the growing scale of larger datasets and heavier model serving traffic.  It’s nontrivial to design large-scale distributed machine learning systems.  A distributed machine learning system is usually a pipeline of many components, such as data ingestion, model training, serving, and monitoring.  Using good patterns to design the components of a machine learning system can speed up the development and deployment of machine learning models, enable the use of automations from different tools, and benefit from hardware acceleration.

本书的第二部分阐释了此类模式。目前，分布式机器学习系统中的一些模式也出现在通用分布式系统中，包括专门为处理大规模机器学习工作负载而设计的模式。

1.3.3 分布式机器学习系统的应用场景如图 1.1 和 1.2 所示，如果因为数据集太大，以至于我们的本地笔记本电脑无法容纳时，我们可以使用数据分区等模式或引入额外的工作节点来加速模型训练。当出现以下任一情况时，我们就应该开始考虑设计分布式机器学习系统：  模型非常大，由数百万个参数组成，单台机器无法存储，必须分区存储在不同的机器上。  机器学习应用程序需要处理不断增加的大流量，而单个服务器无法再支撑。  目前的任务涉及模型生命周期的许多阶段，例如：数据摄取、模型服务、数据/模型版本控制和性能监控。  我们希望使用较多的计算资源来进行加速，例如：数十台服务器，每台服务器都有许多 GPU。如果发生以上提到的任一情况，通常表明在不久的将来，我们需要一个设计良好的分布式机器学习系统。

1.3.4 不适合使用分布式机器学习系统的场景尽管分布式机器学习系统在许多情况下都很有帮助，但它通常设计难度大，需要丰富经验才能高效运营。开发和维护这样一个复杂的系统需要额外的开销和利弊权衡。如果您遇到以下任何情况，请继续使用已经行之有效的简单方法：  数据集较小，例如小于 10 GB 的 CSV（Comma-separated values，一种特殊的文件类型，可在 Excel 中创建或编辑）文件。  模型简单，不需要复杂的计算，如线性回归。  计算资源有限，但足以完成任务。

1.4 本书所涵盖的内容在本书中，我们将学习选择和应用正确的模式来构建和部署分布式机器学习系统，以获得管理和自动化机器学习任务的实践经验。我们将使用几种流行的框架和尖端技术来构建分布式机器学习工作流程的组件，包括以下内容：  TensorFlow (https://www.tensorflow.org)  Kubernetes (https://kubernetes.io)  Kubeflow (https://www.kubeflow.org)  Docker (https://www.docker.com)  Argo Workflows (https://argoproj.github.io/workflows/) 本书最后一部分是一个综合的实践项目，包括一个端到端的分布式机器学习流水线系统。图 1.7 是我们将要构建的系统的架构图。我们将通过实现以下章节中介绍的各种模式来获得实践经验。处理大规模问题，并将我们在个人笔记本电脑上开发的内容应用于大型分布式集群，是令人兴奋的。

我们将使用 TensorFlow 和 Python 为各种任务构建机器学习和深度学习模型，例如基于现实数据集构建特征、训练预测模型以及进行实时预测。我们还将使用 Kubeflow 在 Kubernetes 集群中运行分布式机器学习任务。此外，我们将使用 Argo Workflows 构建一个机器学习流水线，其中包含分布式机器学习系统的许多重要组件。第二章介绍了这些技术的基础知识，因此我们将在本书的第二部分中获得这些技术的实践经验。表 1.1 显示了本书将涵盖的关键技术和示例。

图 1.7 在本书最后一部分中构建的端到端机器学习系统的架构图

表 1.1 本书涵盖的技术及其用途 TensorFlow 构建机器学习和深度学习模型 Kubernetes 管理分布式环境和资源 Kubeflow 在 Kubernetes 集群上轻松提交和管理分布式训练作业 Argo 定义、编排和管理工作流 Docker 构建和管理用于启动容器环境的镜像

在深入第二章的详细内容之前，建议读者具备使用 Python 构建机器学习模型的基础知识和实践经验。虽然生产环境或分布式系统的经验不是必需的，但还是希望读者至少接触过在生产环境中运行的机器学习应用程序，并且至少编写过一年的 Python 和 Bash 脚本。此外，读者还需要了解 Docker 的基础知识，并能够使用 Docker 命令行界面管理镜像和容器。熟悉基本的 YAML 语法会有所帮助，但不是必需的；语法很直观，很容易掌握。如果以上提到的这些内容大多数对您来说都是陌生的，我建议您在进一步阅读之前先从其他资料中了解学习以上内容。

概括说明

在现实应用中部署的机器学习系统通常需要处理不断增长的较大数据集规模和较重的模型服务流量。
设计大规模分布式机器学习系统并非易事。
分布式机器学习系统通常是由许多组件组成的流水线，例如：数据摄取、模型训练、服务和监控。
使用良好的模式来设计机器学习系统的组件可以加快机器学习模型的开发和部署，支持使用不同的工具实现自动化，并从硬件加速中受益。

Original 30 GB dataset 原始的 30 GB 数据集 Partition 1 分区1 Partition 2 分区2 Partition 3 分区3 Machine 1 with 10 GB storage 机器 1 具有 10 GB 存储空间 Machine 2 with 10 GB storage 机器 2 具有 10 GB 存储空间 Machine 3 with 10 GB storage 机器 3 具有 10 GB 存储空间

Initialize and train model, Transfer and update model 初始化和训练模型、传输和更新模型

CATALOG

FEATURED TAGS

FRIENDS