Hao Liang's Blog | Lianghao Blog

preface In recent years, advances in machine learning have made tremendous progress, yet large-scale machine learning remains challenging. Take model training as an example. With the variety of machine learning frameworks such as TensorFlow, PyTorch, and XGBoost, it’s not easy to automate the process of training machine learning models on distributed Kubernetes clusters. Different models require different distributed training strategies, such as utilizing parameter servers and collective communication strategies that use the network structure. In a real-world machine learning system, many other essential components, such as data ingestion, model serving, and workflow orchestration, must be designed carefully to make the system scalable, efficient, and portable. Machine learning researchers with little or no DevOps experience cannot easily launch and manage distributed training tasks. Many books have been written on either machine learning or distributed systems. However, there is currently no book available that talks about the combination of both and bridges the gap between them. This book will introduce many patterns and best practices in large-scale machine learning systems in distributed environments. This book also includes a hands-on project that builds an end-to-end distributed machine learning system that incorporates a lot of the patterns that we cover in the book. We will use several state-of-art technologies to implement the system, including Kubernetes, Kubeflow, TensorFlow, and Argo. These technologies are popular choices when building a distributed machine learning system from scratch in a cloud-native way, making it very scalable and portable. I’ve worked in this area for years, including maintaining some of the open source tools used in this book and leading teams to provide scalable machine learning infrastructure. These patterns and their tradeoffs are always considered when designing systems from scratch or improving existing systems in my daily work. I hope this book will be helpful to you as well!

前言近年来，机器学习取得了巨大进步，但大规模机器学习仍然面临挑战。以模型训练为例，借助 TensorFlow、PyTorch 和 XGBoost 等机器学习框架，自动化地在分布式Kubernetes集群上训练机器学习模型的过程并不简单。不同的模型需要不同的分布式训练策略，例如：参数服务器和使用网络结构的集合通信策略。在实际的机器学习系统中，还必须详尽地设计许多其他重要组件，例如数据摄取、模型服务和工作流编排，以使系统具有可扩展性、高效性和可移植性。缺乏 DevOps 经验的机器学习研究人员无法轻松启动和管理分布式训练任务。

目前已经有很多关于机器学习或分布式系统的书籍问世。但目前还没有一本书能够同时涵盖两者，并弥合它们之间的差距。因此，本书将介绍分布式环境中大规模机器学习系统的模式和最佳实践。

此外，本书还包括一个实践项目，通过构建一个端到端的分布式机器学习系统，将书中介绍的许多模式应用于实际场景。为了实现这个系统，我们将采用一些最先进的技术，包括 Kubernetes、Kubeflow、TensorFlow 和 Argo。当我们以云原生方式从头开始构建分布式机器学习系统时，这些技术是非常流行的选择，因为它们能够提供可扩展性和可移植性。作为作者，我在这个领域工作了多年，包括维护一些在本书中使用的开源工具，并领导团队提供可扩展的机器学习基础设施。在我的日常工作中，无论是从头开始设计系统还是改进现有系统，我始终考虑这些模式并权衡其利弊。我希望这本书也能对您有所帮助！

acknowledgments First and foremost, I want to thank my wife, Wenxuan. You’ve always supported me, always patiently listened while I struggled to get this book done, always made me believe I could finish this project, and helped take care of the kids while I was working on the book. Thanks to my three lovely kids, who brought smiles to me whenever I got stuck. I love you all. Next, I’d like to acknowledge Patrick Barb, my previous development editor, for your patience and guidance over the years. I also thank Michael Stephens for guiding the direction of this book and helping me get through the tough times when I doubted myself. Thanks also to Karen Miller and Malena Selic for providing a smooth transition and helping me move quickly to the production stage. Your commitment to the quality of this book has made it better for everyone who reads it. Thanks as well to all the other folks at Manning who worked with me on the production and promotion of the book. It was truly a team effort. Thanks also to my technical editor, Gerald Kuch, who brought over 30 years of industry experience from several large companies as well as startups and research labs. Gerald’s knowledge and teaching experience covering data structures and algorithms, functional programming, concurrent programming, distributed systems, big data, data engineering, and data science made him an excellent resource for me as the manuscript was developed. Finally, I’d also like to thank the reviewers who took the time to read my manuscript at various stages during its development and provided invaluable feedback. To Al Krinker, Aldo Salzberg, Alexey Vyskubov, Amaresh Rajasekharan, Bojan Tunguz, Cass Petrus, Christopher Kottmyer, Chunxu Tang, David Yakobovitch, Deepika Fernandez, Helder C. R. Oliveira, Hongliang Liu, James Lamb, Jiri Pik, Joel Holmes, Joseph Wang, Keith Kim, Lawrence Nderu, Levi McClenny, Mary Anne Thygesen, Matt Welke, Matthew Sarmiento, Michael Aydinbas, Michael Kareev, Mikael Dautrey, Mingjie Tang, Oleksandr Lapshyn, Pablo Roccatagliata, Pierluigi Riti, Prithvi Maddi, Richard Vaughan, Simon Verhoeven, Sruti Shivakumar, Sumit Pal, Vidhya Vinay, Vladimir Pasman, and Wei Yan, your suggestions helped me improve this book.

致谢

首先，我要感谢我的妻子文轩。你一直支持我，耐心倾听我在完成这本书时的困惑，让我相信我能完成这个项目，并在我写书期间帮助照顾孩子。感谢我的 3 个可爱的孩子，每当我遇到困难时，他们总会为我带来微笑，我爱你们。

接下来，我要感谢我的前策划编辑 Patrick Barb，感谢您多年来的耐心和指导。我还要感谢迈克尔·斯蒂芬斯（Michael Stephens），你在我怀疑自己的时候，指导了本书的方向并帮助我度过了困难时期。感谢 Karen Miller 和 Malena Selic，你们帮助我平稳过渡到了出版阶段。你们对本书质量的承诺让每一个阅读本书的人都受益匪浅。同时，也感谢曼宁（Manning）公司与我一起制作和推广这本书的所有其他人员，这些都是团队合作努力的结果。

还要感谢我的技术编辑杰拉德·库赫（Gerald Kuch），他拥有超过30年的行业经验，曾在多家大公司、初创公司和研究实验室工作。杰拉德的知识和教学经验涵盖了数据结构和算法、函数式编程、并发编程、分布式系统、大数据、数据工程和数据科学等领域，他为我在撰写手稿时提供了宝贵的帮助。

最后，感谢我的审稿人，他们在我手稿撰写的各个阶段花时间阅读并提供了宝贵的反馈意见。感谢 Al Krinker、Aldo Salzberg、Alexey Vyskubov、Amaresh Rajasekharan、Bojan Tunguz、Cass Petrus、Christopher Kottmyer、Chunxu Tang、David Yakobovitch、Deepika Fernandez、Helder C. R. Oliveira、Hongliang Liu、James Lamb、Jiri Pik、Joel Holmes、Joseph Wang、Keith Kim、Lawrence Nderu、Levi McClenny、Mary Anne Thygesen、Matt Welke、Matthew Sarmiento、Michael Aydinbas、Michael Kareev、Mikael Dautrey、Mingjie Tang、Oleksandr Lapshyn、Pablo Roccatagliata、Pierluigi Riti、Prithvi Maddi、Richard Vaughan、Simon Verhoeven、Sruti Shivakumar、Sumit Pal、Vidhya Vinay、Vladimir Pasman 和 Wei Yan，你们的建议帮助我改进了这本书。

about this book

Distributed Machine Learning Patterns is filled with practical patterns for running machine learning systems on distributed Kubernetes clusters in the cloud. Each pattern is designed to help solve common challenges faced when building distributed machine learning systems, including supporting distributed model training, handling unexpected failures, and dynamic model serving traffic. Real-world scenarios provide clear examples of how to apply each pattern, alongside the potential tradeoffs for each approach. Once you’ve mastered these cutting-edge techniques, you’ll put them all into practice and finish up by building a comprehensive distributed machine learning system.

Who should read this book? Distributed Machine Learning Patterns is for data analysts, data scientists, and software engineers familiar with the basics of machine learning algorithms and running machine learning in production. Readers should be familiar with the basics of Bash, Python, and Docker.

How this book is organized: A roadmap The book has three sections that cover nine chapters. Part 1 provides some background and concepts around distributed machine learning systems. We will discuss the growing scale of machine learning applications and the complexity of distributed systems and introduce a couple of patterns often seen in both distributed systems and distributed machine learning systems.

关于这本书

《分布式机器学习模式》是一本关于在云上分布式 Kubernetes 集群中运行机器学习系统的实用模式的书籍。每个模式都旨在帮助解决构建分布式机器学习系统时面临的常见挑战，包括支持分布式模型训练、处理意外故障和动态模型服务流量。通过真实场景的示例，清晰地展示了如何应用每个模式，以及每种方法潜在的利弊权衡。一旦掌握了这些前沿技术，您就能够将其应用于实践，并最终构建一个全面的分布式机器学习系统。

本书阅读对象适合阅读本书的人群包括熟悉机器学习算法基础和在生产环境中运行机器学习系统的数据分析师、数据科学家和软件工程师。读者应该具备基本的 Bash、Python 和 Docker 知识。

本书的组织方式：路线图本书分为 3 个部分，共 9 章。第一部分提供了关于分布式机器学习系统的背景和概念。我们将讨论随着机器学习应用程序规模的不断增长，分布式系统的复杂性，并介绍一些常见的分布式系统和分布式机器学习系统中的模式。

Part 2 presents some of the challenges involved in various components of a machine learning system and introduces a few established patterns adopted heavily in industries to address those challenges:  Chapter 2 introduces the data ingestion patterns, including batching, sharding, and caching, to efficiently process large datasets.  Chapter 3 includes three patterns that are often seen in distributed model training, which involves parameter servers, and collective communication, as well as elasticity and fault-tolerance.  Chapter 4 demonstrates how useful replicated services, sharded services, and event-driven processing can be to model serving.  Chapter 5 describes several workflow patterns, including fan-in and fan-out patterns, synchronous and asynchronous patterns, and step memoization patterns, which will usually be used to create complex and distributed machine learning workflows.  Chapter 6 ends this part with scheduling and metadata patterns that can be useful for operations. Part 3 goes deep into an end-to-end machine learning system to apply what we learned previously. Readers will gain hands-on experience implementing many patterns previously learned in this project:  Chapter 7 goes through the project background and system components.  Chapter 8 covers the fundamentals of the technologies we will use for our project.  Chapter 9 ends the book with a complete implementation of an end-to-end machine learning system. In general, if readers already know what a distributed machine learning system is, Part 1 can be skipped. All chapters in Part 2 can be read independently since each covers a different perspective in distributed machine learning systems. Chapters 7 and 8 are prerequisites for the project we build in chapter 9. Chapter 8 can be skipped if readers are already familiar with the technologies.

第二部分介绍了机器学习系统各个组件所涉及的一些挑战，并介绍了行业中广泛采用的一些既定模式来应对这些挑战：  第二章介绍了数据摄取模式，包括批处理、分片和缓存，以便高效处理大型数据集。  第三章包括分布式模型训练中常见的 3 种模式，其中涉及参数服务器、集合通信、弹性和容错性。  第四章演示了复制服务、分片服务和事件驱动处理在模型服务中的实用性。  第五章描述了几种工作流模式，包括扇入和扇出模式、同步和异步模式以及备忘模式，这些模式通常用于创建复杂的分布式机器学习工作流。  第六章以调度和元数据模式结束了这一部分，这些模式对于运维操作非常有用。

第三部分深入探讨了一个端到端的机器学习系统，以应用我们之前学到的知识。读者将通过实现之前在该项目学到的模式来获得实践经验：  第七章介绍了项目背景和系统组件。  第八章涵盖了我们将在项目中使用的技术的基础知识。  第九章通过实现一个完整的端到端机器学习系统来结束本书。

一般来说，如果读者已经知道什么是分布式机器学习系统，就可以直接跳过第一部分。第二部分中的所有章节都可以独立阅读的，因为每个章节都涵盖了分布式机器学习系统的不同视角。第七章和第八章是我们在第九章中构建项目的先决条件。如果读者已经熟悉这些技术，可以直接跳过第八章。

About the code You can get executable snippets of code from the liveBook (online) version of this book at https://livebook.manning.com/book/distributed-machine-learning-patterns. The complete code for the examples in the book is available for download from the Manning website at www.manning.com and from the GitHub repo at https://github.com/terrytangyuan/distributed-ml-patterns. Please submit any issues to the GitHub repo, which will be actively watched and maintained.

liveBook discussion forum Purchase of Distributed Machine Learning Patterns includes free access to liveBook, Manning’s online reading platform. Using liveBook’s exclusive discussion features, you can attach comments to the book globally or to specific sections or paragraphs. It’s a snap to make notes for yourself, ask and answer technical questions, and receive help from the author and other users. To access the forum, go to https://livebook.manning.com/book/distributed-machine-learning-patterns/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/discussion. Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the author some challenging questions lest their interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website for as long as the book is in print.

关于代码您可以从本书的 liveBook（在线）版本获取可执行代码片段：https://livebook.manning.com/book/distributed-machine-learning-patterns 。书中示例的完整代码可以从曼宁（Manning）网站 www.manning.com 和 GitHub 代码仓库 https://github.com/terrytangyuan/distributed-ml-patterns 下载。如有任何问题，请将问题提交到 GitHub 代码仓库，我们将积极地关注和维护。

LiveBook 论坛购买《分布式机器学习模式》即可免费访问曼宁（Manning）的在线阅读平台 liveBook。使用 liveBook 独有的讨论功能，您可以在全书或特定章节或段落中附加评论。您可以方便地做笔记、提出和回答技术问题以及从作者和其他用户那里获得帮助。要访问论坛，请访问：https://livebook.manning.com/book/distributed-machine-learning-patterns/discussion 。您还可以在 https://livebook.manning.com/discussion 了解有关曼宁（Manning）论坛和行为规则的更多信息。曼宁（Manning）承诺为读者提供一个有意义的对话平台，让读者之间以及读者与作者之间可以进行有意义地交流。注意，这并不意味着作者有义务参与论坛，作者对论坛的贡献仍然是自愿且无偿的。我们建议您尝试向作者提出一些具有挑战性的问题，以保持作者的兴趣！论坛及其问题讨论的存档能够从本书印刷期间从出版商的网站上访问。

about the author YUAN TANG is a founding engineer at Akuity, building an enterprise-ready platform for developers. He has previously led data science and engineering teams at Alibaba and Uptake, focusing on AI infrastructure and AutoML platforms. He’s a project lead of Argo and Kubeflow, a maintainer of TensorFlow and XGBoost, and the author of numerous open source projects. In addition, Yuan has authored three machine learning books and several publications. He’s a regular speaker at various conferences and a technical advisor, leader, and mentor at various organizations

about the cover illustration The figure on the cover of Distributed Machine Learning Patterns is “Homme Corfiote,” or “Man from Corfu,” taken from a collection by Jacques Grasset de Saint-Sauveur, published in 1797. Each illustration is finely drawn and colored by hand. In those days, it was easy to identify where people lived and what their trade or station in life was just by their dress. Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional culture centuries ago, brought back to life by pictures from collections such as this one.

关于作者唐源是 Akuity 的创始工程师，致力于为开发者构建企业级平台。带领过阿里巴巴和 Uptake 的数据科学以及工程团队，专注于建立 AI 基础架构和平台。他是 Argo 和 Kubeflow 的项目负责人、 TensorFlow 和 XGBoost 的维护者以及众多开源项目的作者。此外，他著有 3 本机器学习书籍以及多篇有影响力的论文。他不定期地在不同的技术会议上发表演讲。同时也是多个公司和开源组织的技术顾问、团队领导、以及导师。

关于封面插图《分布式机器学习模式》封面上的人物是“Homme Corfiote”，即“来自科孚岛的男人”，摘自 Jacques Grasset de Saint-Sauveur 于 1797 年出版的文集。每幅插图均由手工精心绘制和着色。在那个时代，人们可以通过他们的服饰轻松辨认出他们居住的地方以及他们的职业或社会地位。曼宁（Manning）以几个世纪前地域文化的丰富多样性为基础的插图作为本书封面，以颂扬计算机行业的创造力和主动性。

CATALOG

FEATURED TAGS

FRIENDS