Posted by Hao Liang's Blog on Monday, January 1, 0001

Deploying a machine learning application on a modern distributed system puts the spotlight on reliability, performance, security, and other operational concerns. In this in-depth guide, Yuan Tang, project lead of Argo and Kubeflow, shares patterns, examples, and hard-won insights on taking an ML model from a single device to a distributed cluster. Distributed Machine Learning Patterns provides dozens of techniques for designing and deploying distributed machine learning systems. In it, you’ll learn patterns for distributed model training, managing unexpected failures, and dynamic model serving. You’ll appreciate the practical examples that accompany each pattern along with a full-scale project that implements distributed model training and inference with autoscaling on Kubernetes. What’s Inside ● Data ingestion, distributed training, model serving, and more ● Automating Kubernetes and TensorFlow with Kubeflow and Argo Workflows ● Manage and monitor workloads at scale For data analysts and engineers familiar with the basics of machine learning, Bash, Python, and Docker. Yuan Tang is a project lead of Argo and Kubeflow, maintainer of TensorFlow and XGBoost, and author of numerous open source projects. Gerald Kuch was the tehnical editor for this book.

在现代分布式系统上部署机器学习应用时,我们逐渐将关注焦点转向了可靠性、性能、安全性以及解决这些问题所带来的运维挑战。 在这本深入的指南中,Argo 和 Kubeflow 的项目负责人唐源分享了将机器学习模型从单机环境迁移到复杂的分布式集群中的模式、示例和所积累的宝贵经验。 《分布式机器学习模式》将详细介绍数十种设计和部署分布式机器学习系统的技术。 你将通过使用各种模式解决如下问题:如何进行分布式模型训练、如何应对突发的系统故障,以及如何部署动态的模型服务。 本书为每种模式都配备了实际的案例分析,以及基于 Kubernetes 实现分布式模型训练和弹性推理的完整项目。 本书的主要内容: ● 数据摄取、分布式训练、模型服务等概念 ● 使用 Kubeflow 和 Argo 工作流在 Kubernetes 上实现 TensorFlow 的自动化部署 ● 管理和监控大规模机器学习工作负载 适合熟悉机器学习、Bash、Python 和 Docker 基础知识的数据分析师和工程师。 唐源是 Argo 和 Kubeflow 的项目负责人,同时也是 TensorFlow 和 XGBoost 项目的核心维护者,以及众多开源项目的作者。 杰拉尔德·库赫是本书的技术编辑。

“Approachable for beginners and inspirational for experienced practitioners. As soon as I finished reading, I was ready to start building.” — James Lamb, SpotHero “Exceptionally timely and comprehensive. Its pattern perspective, accompanied by real-world examples and widely adopted systems like Kubernetes, Kubeflow, and Argo, truly set it apart.” —Yuan Chen, Apple “An amazing guide to designing resilient and scalable ML systems for both training and serving models.” —Ryan Russon, Capital One “A wonderful book!Machine learning at scale explained clearly and from first principles!” —Laurence Moroney, Google

“这本书对于初学者而言是非常友好的入门读物,对于经验丰富的从业者来说也具有启发性。读完这本书后,我已经做好了亲自动手构建的准备。” ——詹姆斯·兰姆,SpotHero “一本非常适时且全面的书。它从模式视角出发,结合了实际示例和广泛使用的系统,如:Kubernetes、Kubeflow 和 Argo,使其真正地与众不同。” ——陈源,苹果公司 “一本了不起的指南!它清晰地解释了弹性可扩展的大规模机器学习系统的设计,包括模型的训练和服务。” ——瑞安·拉森,第一资本 “一本精彩的书!它从基本原理出发,对大规模机器学习进行了清晰的解释!” ——劳伦斯·莫罗尼,谷歌