第5章 工作流模式
We want to build a machine learning system to train different models. We then want to use the top two models to generate predictions so that the entire system is less likely to miss any entities in the videos since the two models may capture information from different perspectives. 我们计划构建一个机器学习系统,该系统能够训练多种模型,并利用表现最优的前两个模型进行预测。这样整个系统在处理视频样本时,就能够从不同角度捕获信息,极大地降低遗漏实体的可能性。
问题描述
We want to build a machine learning workflow that would train different models after the system has ingested data from the data source. Then, we want to select the top two models and use the knowledge from both to provide model serving that generates predictions for users. Building a workflow that includes the end-to-end normal process of a machine learning system with only data ingestion, model training, and model serving, where each component only appears once as an individual step in the workflow, is pretty straightforward. However, in our particular scenario, the workflow is much more complex as we need to include multiple model training steps as well as multiple model serving steps. How do we formalize and generalize the structure of this complex workflow so that it can be easily packaged, reused, and distributed?
我们的目标是构建一个机器学习工作流,它能够从数据源获取数据,并训练出多个不同的模型。 然后,我们会选择前两个表现最好的模型来为用户提供预测服务。 将一个工作流构建为包含了端到端过程的机器学习系统,其中只包含数据摄取、模型训练和模型服务 3 个组件。 每个组件在工作流中仅作为一个单独的步骤出现一次,使得整个过程简洁明了。 然而,在特定场景中,由于需要考虑多模型训练和多模型服务的步骤,实际的工作流可能会复杂得多。 那么我们应该如何规范化这个复杂的工作流,使其更具通用性,以便于打包、重用和分发呢?
解决方案
Let’s start with the most basic machine learning workflow that includes only data ingestion, model training, and model serving, where each of these components only appears once as an individual step in the workflow. We will build our system based on this workflow to serve as our baseline, as shown in figure 5.7.
让我们从最基本的机器学习工作流开始,它只包含数据摄取、模型训练和模型服务 3 个组件,每个组件在工作流中作为一个单独的步骤仅出现一次。 我们将基于此工作流来构建系统,我们称之为基线工作流,如图 5.7 所示。
Baseline workflow that includes only data ingestion, model training, and model serving where each of these components only appears once as individual steps in the workflow 仅包含数据摄取、模型训练和模型服务组件的基线工作流,其中每个组件作为工作流中的单独步骤仅出现一次
Figure 5.7 A baseline workflow including only data ingestion, model training, and model serving, where each of these components only appears once as an individual step in the workflow 图 5.7 仅包含数据摄取、模型训练和模型服务组件的基线工作流,其中每个组件作为工作流中的单独步骤仅出现一次
Our goal is to represent the machine learning workflow that builds and selects the top two best-performing models that will be used for model serving to give better inference results. Let’s take a moment to understand why this approach might be used in practice. For example, figure 5.8 shows two models: the first model has knowledge of four entities, and the second model has knowledge of three entities. Thus, each can tag the entities it knows from the videos. We can use both models to tag entities at the same time and then aggregate their results. The aggregated result is obviously more knowledgeable and is able to cover more entities. In other words, two models can be more effective and produce more comprehensive entity-tagging results.
我们的目标是构建一个机器学习工作流,该工作流会构建并筛选两个表现最优的模型,用于提升模型服务的推理质量。 让我们稍微花一点时间,深入探究这种方法在实际应用中的可行性。 例如,图5.8中的两个模型:模型一能够识别出 4 种实体,模型二能够识别出 3 种。 因此,每个模型都可以从视频样本中识别并标记出其能够辨认的实体。 通过并行运用这两个模型进行实体标记,并汇总它们的识别结果,我们能够得到一个知识更为丰富、实体覆盖更全面的综合标记结果。 换言之,这种双模型标记策略不仅效率更高,而且能够生成更为全面的实体标记结果。
Figure 5.8 A diagram of models where the first model has knowledge of four entities and the second model has knowledge of three entities. Thus, each can tag the entities it knows from the videos. We can use both models to tag entities at the same time and then aggregate their results. The aggregated result covers more entities than each individual model.
图 5.8 构建两个模型,其中第一个模型能够识别出 4 种实体,第二个模型能够识别出 3 种实体。因此,每个模型都可以从视频样本中标记自己能够识别出的实体。我们可以同时使用这两个模型来标记实体,然后聚合它们的结果。聚合结果涵盖的实体比每个单独的模型识别得到的实体更全面。
Now that we understand the motivation behind building this complex workflow, let’s look at an overview of the entire end-to-end workflow process. We want to build a machine learning workflow that performs the following functions sequentially: 1 Ingests data from the same data source 2 Trains multiple different models, either different sets of hyperparameters of the same model architecture or various model architectures 3 Picks the two top-performing models to be used for model serving for each of the trained models 4 Aggregates the models’ results of the two model serving systems to present to users
Let’s first add some placeholders to the baseline workflow for multiple model training steps after data ingestion. We can then add multiple model serving steps once the multiple model training steps finish. A diagram of the enhanced baseline workflow is shown in figure 5.9. The key difference from what we’ve dealt with before in the baseline is the presence of multiple model training and model serving components. The steps do not have direct, one-to-one relationships. For example, each model training step may be connected to a single model serving step or not connected to any steps at all.
现在,我们已经理解了构建这个复杂工作流的初衷,接下来让我们进一步了解整个端到端工作流的概览。 我们希望构建的机器学习工作流将按如下步骤顺序执行: 1 从同一数据源提取数据。 2 训练多个模型,这些模型可能是基于相同架构但配置了不同超参数的,也可以是基于完全不同的架构。 3 筛选出两个表现最优的模型,将其部署到模型服务中。 4 汇总两个模型服务的结果并呈现给用户。
在数据提取完成后,我们将在多模型训练的基线工作流中添加一些预留步骤,以便进一步的扩展和优化。 在多个模型训练步骤完成后,我们便会在工作流中添加多个模型服务的步骤。 在图5.9中,我们可以看到添加了模型服务步骤后的工作流示意图。 与我们之前讨论的基线工作流相比,这个工作流的特点在于包含了多个模型训练和模型服务的步骤,而这些步骤之间没有直接的、一对一的联系。 例如,某个模型训练的步骤可能与一个模型服务步骤相连,也可能不与任何模型服务步骤相连。
Figure 5.9 A diagram of the enhanced baseline workflow where multiple model training steps occur after data ingestion, followed by multiple model serving steps 图 5.9 增强后的基线工作流图示,其中数据摄取步骤和多个模型训练步骤相连,每个模型训练步骤和模型服务步骤相连
Figure 5.10 shows that the models trained from the first two model training steps outperform the model trained from the third model training step. Thus, only the first two model training steps are connected to the model serving steps.
如图 5.10 所示,前两个模型训练步骤所训练的模型表现要优于第三个模型训练步骤所训练的模型。因此,只有前两个模型训练步骤和模型服务步骤相连。
We can compose this workflow as follows. On successful data ingestion, multiple model training steps are connected to the data ingestion step so that they can use the shared data that’s ingested and cleaned from the original data source. Next, a single step is connected to the model training steps to select the top two performing models. It produces two model serving steps that use the selected models to handle model serving requests from users. A final step at the end of this machine learning workflow is connected to the two model serving steps to aggregate the model inference results that will be presented to the users.
A diagram of the complete workflow is shown in figure 5.11. This workflow trains different models via three model training steps resulting in varying accuracy when tagging entities. A model selection step picks the top two models with at least 90% accuracy trained from the first two model training steps that will be used in the following two separate model serving steps. The results from the two model serving steps are then aggregated to present to users via a result aggregation step.
我们可以按如下方式构建工作流:在数据摄取完成后,我们将多个模型训练步骤和数据摄取步骤相连,确保它们可以利用从原始数据源中提取并清洗的共享数据。 紧接着,我们引入一个模型选择步骤,该步骤紧跟在模型训练之后,用以筛选表现最优的两个模型。 这一模型选择步骤进一步与两个模型服务步骤相连,使得选定的模型能够响应来自用户的请求并提供模型服务。 在工作流的最后阶段,我们将两个模型服务步骤的输出结果汇聚到了一个模型聚合步骤中,该步骤负责整合信息,并将最终的模型推理结果展示给用户。
整个工作流的完整图示见图 5.11 。 由于该工作流包含了 3 个独立的模型训练步骤,每个模型在标记实体的准确率上也就各不相同。 在模型选择步骤中,我们筛选出了在前两个训练步骤中达到至少90%准确率的最优的两个模型,这两个模型随后将在两个独立的模型服务步骤中得到应用。 最后,我们将这两个模型服务步骤的推理结果通过聚合步骤来最终呈现给用户。
Three model training steps train different models that arrive at different accuracies when tagging entities. 3 个模型训练步骤训练不同的模型,这些模型在标记实体时达到不同的准确率。
This step picks the top two models that will be used in the following two separate model serving steps. 此步骤筛选即将在后续的两个模型服务步骤中使用的最优的两个模型。
The results from the two model serving steps are then aggregated via a result aggregation step to present to users. 通过结果聚合步骤来聚合两个模型服务步骤的推理结果,最终呈现给用户。
Figure 5.11 A machine learning workflow that trains different models that result in varying accuracy when tagging entities and then selects the top two models with at least 90% accuracy to be used for model serving. The results from the two model serving steps are then aggregated to present to users.
图 5.11 一个支持多模型训练的机器学习工作流,这些模型在标记实体时会得到不同的准确率,然后挑选出准确率至少为 90% 的前两个最优模型用于模型服务,最后将两个模型服务步骤的结果汇总后呈现给用户。
We can abstract out two patterns from this complex workflow. The first one we observe is the fan-out pattern. Fan-out describes the process of starting multiple separate steps to handle input from the workflow. In our workflow, the fan-out pattern appears when multiple separate model training steps connect to the data ingestion step, as shown in figure 5.12. There’s also the fan-in pattern in our workflow, where we have one single aggregation step that combines the results from the two model serving steps, as shown in figure 5.13. Fan-in describes the process of combining results from multiple steps into one step.
我们可以从这个复杂的工作流中抽象出两种模式。 首先我们注意到的是扇出模式。 扇出描述了一个工作流输入被多个独立步骤并行处理的过程。 在我们的工作流中,当一个数据摄取步骤分别与多个单独的模型训练步骤相连时,便形成了所谓的扇出模式,如图 5.12 所示。
此外,我们的工作流中还采用了扇入模式,也就是通过一个聚合步骤来合并两个模型服务步骤输出的结果,如图 5.13 所示。 扇入描述了将多个步骤的输出结果合并为一个步骤的过程。
Fanning out to three separate model training steps from one data ingestion step. 将一个数据摄取步骤扇出到 3 个独立的模型训练步骤。
Figure 5.12 A diagram of the fan-out pattern that appears when multiple separate model training steps are connected to the data ingestion step 图 5.12 扇出模式:一个数据摄取步骤分别与多个单独的模型训练步骤相连。
figure 5.13. Fan-in describes the process of combining results from multiple steps into one step. 如图 5.13 所示。扇入描述了将多个步骤的输出结果合并为一个步骤的过程。
Fanning in from two model serving steps to one result aggregation step. 将两个模型服务步骤扇入到一个结果聚合步骤。
Figure 5.13 A diagram of the fan-in pattern, where we have one single aggregation step that combines the results from the two model serving steps 图 5.13 扇入模式图示,我们用一个聚合步骤来合并两个模型服务步骤输出的结果。
Formalizing these patterns would help us build and organize more complex workflows by using different patterns for workflows based on real-world requirements. We have successfully built the system as a complex workflow that trains different models and then uses the top two models to generate predictions so that the entire system is less likely to miss any entities in the videos. These patterns are powerful when constructing complex workflows to meet real-world requirements. We can construct various workflows, from a single data processing step to multiple model training steps to train different models with the same dataset. We can also start more than one model serving step from each of these model training steps if the predictions from different models are useful in real-world applications. We’ll apply this pattern in section 9.4.1.
通过规范化这些模式,我们能够根据实际需求灵活地使用不同的工作流模式,从而构建和组织更为复杂的工作流。 我们已经成功地构建了一个机器学习系统,该系统能够训练多种模型,并利用表现最优的两个模型进行预测,从而极大地降低了在视频样本中遗漏实体的可能性。 在构建复杂的工作流以满足实际需求时,这种模式展现出了强大的能力。 此外,我们还可以构建多样化的工作流,例如,构建一个将单一的数据处理步骤和多模型训练步骤相连的工作流,以便利用同一数据集来训练多种模型。 如果不同模型的预测结果在实际应用中表现出色,我们还可以在每个模型训练步骤后创建多个模型服务步骤。 我们将在第 9.4.1 节中应用这种模式。
讨论
By using the fan-in and fan-out patterns in the system, the system is now able to execute complex workflows that train multiple machine learning models and pick the most performant ones to provide good entity-tagging results in the model serving system. These patterns are great abstractions that can be incorporated into very complex workflows to meet the increasing demand for complex distributed machine learning workflows in the real world. But what kind of workflows are suitable for the fan-in and fan-out patterns? In general, if both of the following applies, we can consider incorporating these patterns:
- The multiple steps that we are fanning-in or fanning-out are independent of each other.
- It takes a long time for these steps to run sequentially. The multiple steps need to be order-independent because we don’t know the order in which concurrent copies of those steps will run or the order in which they will return. For example, if the workflow also contains a step that trains an ensemble of other models (also known as ensemble learning; http://mng.bz/N2vn) to provide a better-aggregated model, this ensemble model depends on the completion of other model training steps. Consequently, we cannot use the fan-in pattern because the ensemble model training step will need to wait for other model training to complete before it can start running, which would require some extra waiting and delay the entire workflow.
通过在系统中引入扇入和扇出模式,我们能够执行复杂的工作流,训练多个机器学习模型,并选择性能最优的模型,以在模型服务系统中提供高质量的实体标记结果。 这些模式是很好的抽象工具,可以被融入到及其复杂的工作流中,以满足现实中对复杂分布式机器学习工作流日益增长的需求。 那么,什么样的工作流适合扇入和扇出模式呢? 通常情况下,如果以下两种情况都适用,我们可以考虑结合使用这两种模式:
- 扇入或扇出的多个步骤是相互独立的。
- 这些步骤依次运行需要很长时间。 多个步骤需要与顺序无关,因为我们无法知道这些步骤的并发副本的运行顺序或它们的返回顺序。
- 例如,如果工作流中还包含集成模型(也称为集成学习:http://mng.bz/N2vn )的训练步骤,从而提供更优的模型聚合能力,那么该集成模型就需要等待其他模型训练完成。
- 因此,我们不能使用扇入模式,因为集成模型训练步骤需要等待其他模型训练完成才能开始运行,这将需要一些额外的等待并延迟整个工作流。
Ensemble models An ensemble model uses multiple machine learning models to obtain better predictive performance than could be obtained from any of the constituent models alone. It often consists of a number of alternative models that can learn the relationships in the dataset from different perspectives. Ensemble models tend to yield better results when diversity among the constituent models is significant. Therefore, many ensemble approaches try to increase the diversity of the models they combine.
集成模型 集成模型利用多个机器学习模型,以获得比任何单一组成模型更优的预测性能。 它通常由许多替代模型组成,这些模型可以从不同的角度学习数据集中的关系。 当集成模型之间的多样性显著时,它往往会产生更好的结果。 因此,许多集成方法会试图增加它们组合的模型的多样性。
The fan-in and fan-out patterns can create very complex workflows that meet most of the requirements of machine learning systems. However, to achieve good performance on those complex workflows, we need to determine which parts of the workflows to run first and which parts of the workflows can be executed in parallel. As a result of the optimization, data science teams would spend less time waiting for workflows to complete, thus reducing infrastructure costs. I will introduce some patterns to help us organize the steps in the workflow from a computational perspective in the next section.
扇入和扇出模式可以通过构建非常复杂的工作流来满足机器学习系统的大部分需求。 然而,为了在这些复杂的工作流上获得良好的性能,我们需要确定工作流的哪些步骤需要先执行,哪些步骤可以并行执行。 经过优化后,数据科学团队将花费更少的时间等待工作流完成,从而降低基础设施成本。 在下一章节中,我将会引入一些新的模式来帮助我们从计算的角度编排工作流中的步骤。