Overview of relevant technologies
This chapter covers Getting familiar with model building using TensorFlow Understanding key terminologies on Kubernetes Running distributed machine learning workloads with Kubeflow Deploying container-native workflows using Argo Workflows
In the previous chapter, we went through the project background and system components to understand our strategies for implementing each component. We also discussed the challenges related to each component and discussed the patterns we will apply to address them. As previously mentioned, we will dive into the project’s implementation details in chapter 9, the book’s last chapter. However, since we will use different technologies in the project and it’s not easy to cover all the basics on the fly, in this chapter, you will learn the basic concepts of the four technologies (Tensor-Flow, Kubernetes, Kubeflow, and Argo Workflows) and gain hands-on experience. Each of these four technologies has a different purpose, but all will be used to implement the final project in chapter 9. TensorFlow will be used for data processing,model building, and evaluation. We will use Kubernetes as our core distributed infrastructure. On top of that, Kubeflow will be used for submitting distributed model training jobs to the Kubernetes cluster, and Argo Workflows will be used to construct and submit the end-to-end machine learning workflows.
8.1 TensorFlow: The machine learning framework TensorFlow is an end-to-end machine learning platform. It has been widely adopted in academia and industries for different applications and uses cases, such as image classification, recommendation systems, natural language processing, etc. TensorFlow is highly portable, deployable on different hardware, and has multilanguage support. TensorFlow has a large ecosystem. The following are some highlighted projects in this ecosystem: TensorFlow.js is a library for machine learning in JavaScript. Users can use machine learning directly in the browser or in Node.js. TensorFlow Lite is a mobile library for deploying models on mobile, microcontrollers, and other edge devices. TFX is an end-to-end platform for deploying production machine learning pipelines. TensorFlow Serving is a flexible, high-performance serving system for machine learning models designed for production environments. TensorFlow Hub is a repository of trained machine learning models ready for fine-tuning and deployable anywhere. Reuse trained models like BERT and Faster R-CNN with just a few lines of code. More can be found in the TensorFlow GitHub organization (https://github.com/tensorflow). We will use TensorFlow Serving in our model serving component. In the next section, we’ll walk through some basic examples in TensorFlow to train a machine learning model locally using the MNIST dataset.
8.1.1 The basics Let’s first install Anaconda for Python 3 for the basic examples we will use. Anaconda(https://www.anaconda.com) is a distribution of the Python and R programming languages for scientific computing that aims to simplify package management and deployment. The distribution includes data-science packages suitable for Windows, Linux, and macOS. Once Anaconda is installed, use the following command in your console to install a Conda environment with Python 3.9.
Listing 8.1 Creating a Conda environment
Next, we can activate this environment with the following code.
Listing 8.2 Activating a Conda environment
Then, we can install TensorFlow in this Python environment.
Listing 8.3 Installing TensorFlow
If you encounter any problems, please refer to the installation guide (https://www.tensorflow.org/install). In some cases, you may need to uninstall your existing NumPy and reinstall it.
Listing 8.4 Installing NumPy
If you are on Mac, check out the Metal plugin for acceleration (https://developer.apple.com/metal/tensorflow-plugin/). Once we’ve successfully installed TensorFlow, we can start with a basic image classification example! Let’s first load and preprocess our simple MNIST dataset. Recall that the MNIST dataset contains images for handwritten digits from 0 to 9. Each row represents images for a particular handwritten digit, as shown in figure 8.1.
Figure 8.1 Some example images for handwritten digits from 0 to 9 where each row represents images for a particular handwritten digit
Each row represents images for a particular handwritten digit. For example, the first row represents images of the digit 0.
Keras API (tf.keras) is a high-level API for model training in TensorFlow, and we will use it for both loading the built-in datasets and model training and evaluation.
Listing 8.5 Loading the MNIST dataset
The function load_data()uses a default path to save the MNIST dataset if we don’t specify a path. This function will return NumPy arrays for training and testing images and labels. We split the dataset into training and testing so we can run both model training and evaluation in our example.
A NumPy array is a common data type in Python’s scientific computing ecosystem. It describes multidimensional arrays and has three properties: data, shape, and dtype. Let’s use our training images as an example.
Listing 8.6 Inspecting the dataset
x_train is a 60,000 × 28 × 28 three-dimensional array. The data type is uint8 from 0 to 255. In other words, this object contains 60,000 grayscale images with a resolution of 28 × 28.
Next, we can perform some feature preprocessing on our raw images. Since many algorithms and models are sensitive to the scale of the features, we often center and scale features into a range such as [0, 1] or [-1, 1]. In our case, we can do this easily by dividing the images by 255.
Listing 8.7 The preprocessing function
After preprocessing the images in the training and testing set, we can instantiate a simple multilayer neural network model. We use tf.keras to define the model architecture. First, we use Flatten to expand the two-dimensional images into a onedimensional array by specifying the input shape as 28 × 28. The second layer is densely connected and uses the ‘relu’ activation function to introduce some nonlinearity. The third layer is a dropout layer to reduce overfitting and make the model more generalizable. Since the handwritten digits consist of 10 different digits from 0 to 9, our last layer is densely connected for 10-class classification with softmax activation.
Listing 8.8 The sequential model definition
After we’ve defined the model architecture, we need to specify three different components:the evaluation metric, loss function, and optimizer.
Listing 8.9 Model compilation with optimizer, loss function, and optimizer
We can then start our model training with five epochs as well as evaluation via the following.
Listing 8.10 Model training using the training data
We should see training progress in the log:
And the log from the model evaluation should look like the following:
We should observe that as the loss decreases during training, the accuracy increases to 97.8% on training data. The final trained model has an accuracy of 97.6% on the testing data. Your result might be slightly different due to the randomness in the modeling process.
After we’ve trained the model and are happy with its performance, we can save it using the following code so that we don’t have to retrain it from scratch next time.
Listing 8.11 Saving the trained model
This code saves the model as file my_model.h5 in the current working directory. When we start a new Python session, we can import TensorFlow and load the model object from the my_model.h5 file.
Listing 8.12 Loading the saved model
We’ve learned how to train a model using TensorFlow’s Keras API for a single set of hyperparameters. These hyperparameters remain constant over the training process and directly affect the performance of your machine learning program. Let’s learn how to tune hyperparameters for your TensorFlow program with Keras Tuner(https://keras.io/keras_tuner/). First, install the Keras Tuner library.
Listing 8.13 Installing the Keras Tuner package
Once it’s installed, you should be able to import all the required libraries.
Listing 8.14 Importing necessary packages
We will use the same MNIST dataset and the preprocessing functions for our hyperparameter tuning example. We then wrap our model definition into a Python function.
Listing 8.15 The model building function using TensorFlow and Keras Tuner
This code is essentially the same as what we used previously for training a model with a single set of hyperparameters, except that we also defined hp_units and hp_learning_rate objects that are used in our dense layer and optimizer.
The hp_units object instantiates an integer that will be tuned between 32 and 512 and used as the number of units in the first densely connected layer. The hp_learning_rate object will tune the learning rate for the adam optimizer that will be chosen from among these values: 0.01, 0.001, or 0.0001.
Once the model builder is defined, we can then instantiate our tuner. There are several tuning algorithms we can use (e.g., random search, Bayesian optimization, Hyperband). Here we use the hyperband tuning algorithm. It uses adaptive resource allocation and early stopping to converge faster on a high-performing model.
Listing 8.16 The Hyperband model tuner
We use the validation accuracy as the objective, and the maximum number of epochs is 10 during model tuning.
To reduce overfitting, we can create an EarlyStopping callback to stop training as soon as the model reaches a threshold for the validation loss. Make sure to reload the dataset into memory if you’ve started a new Python session.
Listing 8.17 The EarlyStopping callback
Now we can start our hyperparameter search via tuner.search().
Listing 8.18 The Hyperparameter search with early-stopping
Once the search is complete, we can identify the optimal hyperparameters and train the model on the data for 30 epochs.
Listing 8.19 Obtaining the best hyperparameters and training the model
When we evaluate the model on our test data, we should see it’s more performant than our baseline model without hyperparameter tuning.
Listing 8.20 Model evaluation on the test data
You’ve learned how to run TensorFlow locally on a single machine. To take the most advantage of TensorFlow, the model training process should be run in a distributed cluster, which is where Kubernetes comes into play. In the next section, I will introduce Kubernetes and provide hands-on examples of the fundamentals.
8.1.2 Exercises 1 Can you use the previously saved model directly for model evaluation? 2 Instead of using the Hyperband tuning algorithm, could you try the random search algorithm?
8.2 Kubernetes: The distributed container orchestration system Kubernetes (also known as K8s) is an open source system for automating the deployment, scaling, and management of containerized applications. It abstracts away complex container management and provides declarative configurations to orchestrate containers in different computing environments.
Containers are grouped into logical units for a particular application for easy management and discovery. Kubernetes builds upon more than 16 years of experience running production workloads at Google, combined with best-in-class ideas and practices from the community. Its main design goal is to make it easy to deploy and manage complex distributed systems, while still benefiting from the improved utilization that containers enable. It’s open source, which gives the community the freedom to take advantage of on-premises, hybrid, or public cloud infrastructure and lets you effortlessly move workloads to where it matters.
Kubernetes is designed to scale without increasing your operations team. Figure 8.2 is an architecture diagram of Kubernetes and its components. However, we won’t be discussing those components because they are not the focus of this book. We will, however, use kubectl (on the left-hand side of the diagram), a command-line interface of Kubernetes, to interact with the Kubernetes cluster and obtain information that we are interested in.
Figure 8.2 An architecture diagram of Kubernetes
We will go through some basic concepts and examples to build our knowledge and prepare the following sections on Kubeflow and Argo Workflows.
8.2.1 The basics First, let’s set up a local Kubernetes cluster. We’ll use k3d (https://k3d.io) to bootstrap the local cluster. k3d is a lightweight wrapper to run k3s (a minimal Kubernetes distribution provided by Rancher Lab) in Docker. k3d makes it very easy to create either single-node or multinode k3s clusters in Docker for local development that requires a Kubernetes cluster. Let’s create a Kubernetes cluster called distml via k3s.
Listing 8.21 Creating a local Kubernetes cluster
We can get the list of nodes for the cluster we created via the following listing.
Listing 8.22 Obtaining the list of nodes in the cluster
In this case, the node was created 1 minute ago, and we are running the v1.25.3+k3s1 version of the k3s distribution. The status is ready so that we can proceed to the next steps.
We can also look at the node’s details via kubectl describe node k3d-distmlserver-0. For example, the labels and system info contain information on the operating system and its architecture, whether this node is a master node, etc.:
The node’s addresses are shown as part of it:
The capacity of the node is also available, indicating how much computational resources are there:
Then we’ll create a namespace called basics in this cluster for our project. Namespaces in Kubernetes provide a mechanism for isolating groups of resources within a single cluster (see http://mng.bz/BmN1). Names of resources need to be unique within a namespace but not across namespaces. The following examples will be in this single namespace.
Listing 8.23 Creating a new namespace
Once the cluster and namespace are set up, we’ll use a convenient tool called kubectx to help us inspect and navigate between namespaces and clusters (https://github.com/ahmetb/kubectx). Note that this tool is not required for day-to-day work with Kubernetes, but it should make Kubernetes much easier to work with for developers. For example, we can obtain a list of clusters and namespaces that we can connect to
Listing 8.24 Switching contexts and namespaces
For example, we can switch to the distml cluster via the k3d-distml context and the basics namespace that we just created using the following listing.
Listing 8.25 Activate context
Switching contexts and namespaces is often needed when working with multiple clusters and namespaces. We are using the basics namespace to run the examples in this chapter, but we will switch to another namespace dedicated to our project in the next chapter.
Next, we will create a Kubernetes Pod. Pods are the smallest deployable units of computing that you can create and manage in Kubernetes. A Pod may consist of one or more containers with shared storage and network resources and a specification for how to run the containers. A Pod’s contents are always co-located and co-scheduled and run in a shared context. The concept of the Pod models an application-specific “logical host,” meaning that it contains one or more application containers that are relatively tightly coupled. In noncloud contexts, applications executed on the same physical or virtual machine are analogous to cloud applications executed on the same logical host. In other words, a Pod is similar to a set of containers with shared namespaces and shared filesystem volumes.
The following listing provides an example of a Pod that consists of a container running the image whalesay to print out a “hello world” message. We save the following Pod spec in a file named hello-world.yaml.
Listing 8.26 An example Pod
To create the Pod, run the following command.
Listing 8.27 Creating the example Pod in the cluster
We can then check whether the Pod has been created by retrieving the list of Pods. Note that pods is plural so we can get the full list of created Pods. We will use the singular form to get the details of this particular Pod later.
Listing 8.28 Getting the list of Pods in the cluster
The Pod status is Completed so we can look at what’s being printed out in the whalesay container like in the following listing.
Listing 8.29 Checking the Pod logs
We can also retrieve the raw YAML of the Pod via kubectl. Note that we use -o yaml here to get the plain YAML, but other formats, such as JSON, are also supported. We use the singular pod to get the details of this particular Pod instead of the full list of existing Pods, as mentioned earlier.
Listing 8.30 Getting the raw Pod YAML
You may be surprised how much additional content, such as status and conditions, has been added to the original YAML we used to create the Pod. The additional information is appended and updated via the Kubernetes server so that client-side applications know the current status of the Pod. Even though we didn’t specify the namespace explicitly, the Pod was created in the basics namespace since we have used the kubens command to set the current namespace.
That’s it for the basics of Kubernetes! In the next section, we will study how to use Kubeflow to run distributed model training jobs in the local Kubernetes cluster we just set up.
8.2.2 Exercises 1 How do you get the Pod information in JSON format? 2 Can a Pod contain multiplier containers?
8.3 Kubeflow: Machine learning workloads on Kubernetes The Kubeflow project is dedicated to making deployments of machine learning workflows on Kubernetes simple, portable, and scalable. The goal of Kubeflow is not to re-create other services but to provide a straightforward way to deploy best-in-class open source systems for machine learning to diverse infrastructures. Anywhere you run Kubernetes, you should be able to run Kubeflow. We will use Kubeflow to submit distributed machine learning model training jobs to a Kubernetes cluster.
Let’s first take a look at what components Kubeflow provides. Figure 8.3 is a diagram that consists of the main components.
Figure 8.3 Main components of Kubeflow
Kubeflow Pipelines (KFP; https://github.com/kubeflow/pipelines) provides Python SDK to make machine learning pipelines easier. It is a platform for building and deploying portable and scalable machine learning workflows using Docker containers. The primary objectives of KFP are to enable the following: End-to-end orchestration of ML workflows Pipeline composability through reusable components and pipelines Easy management, tracking, and visualization of pipeline definitions, runs, experiments, and machine learning artifacts Efficient use of computing resources by eliminating redundant executions through caching Cross-platform pipeline portability through a platform-neutral IR YAML pipeline definition KFP uses Argo Workflows as the backend workflow engine, which I will introduce in the next section, and we’ll use Argo Workflows directly instead of using a higherlevel wrapper like KFP. The ML metadata project has been merged into KFP and serves as the backend for logging metadata produced in machine learning workflows written in KFP.
Next is Katib (https://github.com/kubeflow/katib). Katib is a Kubernetes-native project for automated machine learning. Katib supports hyperparameter tuning, early stopping, and neural architecture search. Katib is agnostic to machine learning frameworks. It can tune hyperparameters of applications written in any language of the users’ choice and natively supports many machine learning frameworks, such as TensorFlow, Apache MXNet, PyTorch, XGBoost, and others. Katib can perform training jobs using any Kubernetes custom resource with out-of-the-box support for Kubeflow Training Operator, Argo Workflows, Tekton Pipelines, and many more. Figure 8.4 is a screenshot of the Katib UI that performs experiment tracking.
Figure 8.4 A screenshot of the Katib UI that performs experiment tracking
Here we can visualize different training and validation accuracies for different set of hyperparameters.
This provides a summary of the trials and highlights the best parameters.
KServe (https://github.com/kserve/kserve) was born as part of the Kubeflow project and was previously known as KFServing. KServe provides a Kubernetes custom resource definition (CRD) for serving machine learning models on arbitrary frameworks. It aims to solve production model serving use cases by providing performant, highabstraction interfaces for common ML frameworks. It encapsulates the complexity of autoscaling, networking, health checking, and server configuration to bring cuttingedge serving features like GPU autoscaling, scale to zero, and canary rollouts to your machine learning deployments. Figure 8.5 is a diagram that illustrates the position of KServe in the ecosystem.
Figure 8.5 KServe positioning in the ecosystem
Kubeflow provides web UI. Figure 8.6 provides a screenshot of the UI. Users can access the models, pipelines, experiments, artifacts, etc. to facilitate the iterative process of the end-to-end model machine life cycle in each tab on the left side. The web UI is integrated with Jupyter Notebooks to be easily accessible. There are also SDKs in different languages to help users integrate with any internal systems. In addition, users can interact with all the Kubeflow components via kubectl since they are all native Kubernetes custom resources and controllers. The training operator(https://github.com/kubeflow/training-operator) provides Kubernetes custom resources that make it easy to run distributed or nondistributed TensorFlow, PyTorch, Apache MXNet, XGBoost, or MPI jobs on Kubernetes. The Kubeflow project has accumulated more than 500 contributors and 20,000 GitHub stars. It’s heavily adopted in various companies and has more than 10 vendors, including Amazon AWS, Azure, Google Cloud, IBM, etc. Seven working groups maintain different subprojects independently. We will use the training operator to submit distributed model training jobs and KServe to build our model serving component. Once you complete the next chapter, I recommend trying out the other subprojects in the Kubeflow ecosystem on your own when needed. For example, if you’d like to tune the performance of the model, you can use Katib’s automated machine learning and hyperparameter tuning functionalities.
Figure 8.6 A screenshot of the Kubeflow UI
Users can access the models, pipelines, experiments, artifacts, etc., to facilitate the iterative process of the end-to-end model machine life cycle.
8.3.1 The basics Next, we’ll take a closer look at the distributed training operator of Kubeflow and submit a distributed model training job that runs locally in the Kubernetes local cluster we created in the previous section. Let’s first create and activate a dedicated kubeflow namespace for our examples and reuse the existing cluster we created earlier.
Listing 8.31 Creating and switching to a new namespace
Then, we must go back to our project folder and apply all the manifests to install all the tools we need.
Listing 8.32 Applying all manifests and installing all the tools
Note that we’ve bundled all the necessary tools in this manifests folder: Kubeflow Training Operator, which we will use in this chapter for distributed model training. Argo Workflows (https://github.com/argoproj/argo-workflows), which we address in chapter 9 when we discuss workflow orchestration and chain all the components together in a machine learning pipeline. We can ignore Argo Workflows for now. As introduced earlier, the Kubeflow Training Operator provides Kubernetes custom resources that make it easy to run distributed or nondistributed jobs on Kubernetes, including TensorFlow, PyTorch, Apache MXNet, XGBoost, MPI jobs, etc.
Before we dive into Kubeflow, we need to understand what custom resources are. A custom resource is an extension of the Kubernetes API not necessarily available in a default Kubernetes installation. It is a customization of a particular Kubernetes installation. However, many core Kubernetes functions are now built using custom resources, making Kubernetes more modular (http://mng.bz/lWw2).
Custom resources can appear and disappear in a running cluster through dynamic registration, and cluster admins can update custom resources independently of the cluster. Once a custom resource is installed, users can create and access its objects using kubectl, just as they do for built-in resources like Pods. For example, the following listing defines the TFJob custom resource that allows us to instantiate and submit a distributed TensorFlow training job to the Kubernetes cluster.
Listing 8.33 TFJob CRD
All instantiated TFJob custom resource objects (tfjobs) will be handled by the training operator. The following listing provides the definition of the deployment of the training operator that runs a stateful controller to continuously monitor and process any submitted tfjobs.
Listing 8.34 Training operator deployment
With this abstraction, data science teams can focus on writing the Python code in TensorFlow that will be used as part of a TFJob specification and don’t have to manage the infrastructure themselves. For now, we can skip the low-level details and use TFJob to implement our distributed model training. Next, let’s define our TFJob in a file named tfjob.yaml.
Listing 8.35 An example TFJob definition
In this spec, we are asking the controller to submit a distributed TensorFlow model training model with two worker replicas where each worker replica follows the same container definition, running the MNIST image classification example.
Once it’s defined, we can submit it to our local Kubernetes cluster via the following listing.
Listing 8.36 Submitting TFJob
We can see whether the TFJob has been submitted successfully by getting the TFJob list.
Listing 8.37 Getting the TFJob list
When we get the list of Pods, we can see two worker Pods, distributed-tfjob-qc8fhworker- 1 and distributed-tfjob-qc8fh-worker-0, have been created and started running. The other Pods can be ignored since they are the Pods that are running the Kubeflow and Argo Workflow operators.
Listing 8.38 Getting the list of Pods
A machine learning system consists of many different components. We only used Kubeflow to submit distributed model training jobs, but it’s not connected to other components yet. In the next section, we’ll explore the basic functionalities of Argo Workflows to connect different steps in a single workflow so that they can be executed in a particular order.
8.3.2 Exercises 1 If your model training requires parameter servers, can you express that in a TFJob?
8.4 Argo Workflows: Container-native workflow engine The Argo Project is a suite of open-source tools for deploying and running applications and workloads on Kubernetes. It extends the Kubernetes APIs and unlocks new and powerful capabilities in application deployment, container orchestration, event automation, progressive delivery, and more. It consists of four core projects: Argo CD, Argo Rollouts, Argo Events, and Argo Workflows. Besides these core projects, many other ecosystem projects are based on, extend, or work well with Argo. A complete list of resources related to Argo can be found at https://github.com/terrytangyuan/awesome-argo.
Argo CD is a declarative, GitOps application delivery tool for Kubernetes. It manages application definitions, configurations, and environments declaratively in Git. Argo CD user experience makes Kubernetes application deployment and life-cycle management automated, auditable, and easy to grasp. It comes with a UI so engineers can see what’s happening in their clusters and watch for applications deployments, etc. Figure 8.7 is a screenshot of the resource tree in the Argo CD UI.
Argo Rollouts is a Kubernetes controller and set of CRDs that provides progressive deployment capabilities. It introduces blue–green and canary deployments, canary analysis, experimentation, and progressive delivery features to your Kubernetes cluster.
Figure 8.7 A screenshot of the resources tree in Argo CD UI
Next is Argo Events. It’s an event-based dependency manager for Kubernetes. It can define multiple dependencies from various event sources like webhooks, Amazon S3, schedules, and streams and trigger Kubernetes objects after successful event dependencies resolution. A complete list of available event sources can be found in figure 8.8.
Finally, Argo Workflows is a container-native workflow engine for orchestrating parallel jobs, implemented as Kubernetes CRD. Users can define workflows where each step is a separate container, model multistep workflows as a sequence of tasks or capture the dependencies between tasks using a graph, and run compute-intensive jobs for machine learning or data processing. Users often use Argo Workflows together with Argo Events to trigger event-based workflows. The main use cases for Argo Workflows are machine learning pipelines, data processing, ETL (extract, transform, load), infrastructure automation, continuous delivery, and integration.
Argo Workflows also provides interfaces such as a command-line interface (CLI), server, UI, and SDKs for different languages. The CLI is useful for managing workflows and performing operations such as submitting, suspending, and deleting workflows through the command line. The server is used for integrating with other services. There are both REST and gRPC service interfaces. The UI is useful for managing and visualizing workflows and any artifacts/logs created by the workflows, as well as other useful information, such as resource usage analytics. We will walk through some examples of Argo Workflows to prepare for our project.
Figure 8.8 Available event sources in Argo Events
Before we look at some examples, let’s make sure we have the Argo Workflows UI at hand. It’s optional since you can still be successful in these examples in the command line to interact directly with Kubernetes via kubectl, but it’s nice to see the directedacyclic graph (DAG) visualizations in the UI as well as access additional functionalities. By default, the Argo Workflows UI service is not exposed to an external IP. To access the UI, use the method in the following listing.
Listing 8.39 Port-forwarding the Argo server
Next, visit the following URL to access the UI: https:/ /localhost:2746. Alternatively, you can expose a load balancer to get an external IP to access the Argo Workflows UI in your local cluster. Check out the official documentation for more details: https://argoproj.github.io/argo-workflows/argo-server/. Figure 8.9 is a screenshot of what the Argo Workflows UI looks like for a map-reduce–style workflow.
Figure 8.9 Argo Workflows UI illustrating a map-reduce–style workflow
The following listing is a basic “hello world” example of Argo Workflows. We can specify the container image and the command to run for this workflow and print out a “hello world” message.
Listing 8.40 “Hello world” example
Let’s go ahead and submit the workflow to our cluster.
Listing 8.41 Submitting the workflow
We can then check whether it was submitted successfully and has started running.
Listing 8.42 Getting the list of workflows
Once the workflow status has changed to Succeeded, we can check the statuses of the Pods created by the workflow. First, let’s find all the Pods associated with the workflow. We can use a label selector to get the list of Pods.
Listing 8.43 Getting the list of Pods belonging to this workflow
Once we know the Pod name, we can get the logs of that Pod.
Listing 8.44 Checking the Pod logs
As expected, we get the same logs as the ones we had with the simple Kubernetes Pod in the previous sections since this workflow only runs one “hello world” step.
The next example uses a resource template where you can specify a Kubernetes custom resource that will be submitted by the workflow to the Kubernetes cluster.
Here we create a Kubernetes config map named cm-example with a simple key-value pair. The config map is a Kubernetes-native object to store key-value pairs.
Listing 8.45 Resource template
This example is perhaps most useful to Python users. You can write a Python script as part of the template definition. We can generate some random numbers using the built-in random Python module. Alternatively, you can specify the execution logic of the script inside a container template without writing inline Python code, as seen in the “hello world” example.
Listing 8.46 Script template
Let’s submit it.
Listing 8.47 Submitting the script template workflow
Now, let’s check its logs to see whether a random number was generated.
Listing 8.48 Check the Pod logs
So far, we’ve only seen examples of single-step workflows. Argo Workflow also allows users to define the workflow as a DAG by specifying the dependencies of each task. The DAG can be simpler to maintain for complex workflows and allows maximum parallelism when running tasks.
Let’s look at an example of a diamond-shaped DAG created by Argo Workflows. This DAG consists of four steps (A, B, C, and D), and each has its own dependencies. For example, step C depends on step A, and step D depends on steps B and C.
Listing 8.49 A diamond example using DAG
Let’s submit it.
When the workflow is completed, we will see four Pods for each of the steps where each step prints out its step name—A, B, C, and D.
Listing 8.51 Getting the list of Pods belonging to this workflow
The visualization of the DAG is available in the Argo Workflows UI. It’s usually more intuitive to see how the workflow is executed in a diamond-shaped flow in the UI, as seen in figure 8.10.
Figure 8.10 A screenshot of a diamond-shaped workflow in the UI
Next, we will look at a simple coin-flip example to showcase the conditional syntaxprovided by Argo Workflows. We can specify a condition to indicate whether we want to run the next step. For example, we run the flip-coin step first, which is the Python script we saw earlier, and if the result returns heads, we run the template called heads,which prints another log saying it was heads. Otherwise, we print that it was tails. So we can specify these conditionals inside the when clause in the different steps.
Listing 8.52 Coin-flip example
Let’s submit the workflow.
Listing 8.53 Submitting the coin-flip example
Figure 8.11 is a screenshot of what this flip-coin workflow looks like in the UI.
Figure 8.11 Screenshot of the flip-coin workflow in the UI
When we get the list of workflows, we find only two Pods.
Listing 8.54 Getting the list of Pods belonging to this workflow
We can check the logs of the flip-coin step to see whether it prints out tails since the next step executed is the tails step:
That’s a wrap! We’ve just learned the basic syntax of Argo Workflows, which should cover all the prerequisites for the next chapter! In the next chapter, we will use ArgoWorkflows to implement the end-to-end machine learning workflow that consists ofthe actual system components introduced in chapter 7.
8.4.2 Exercises 1 Besides accessing the output of each step like {{steps.flip-coin.outputs.result}}, what are other available variables? 2 Can you trigger workflows automatically by Git commits or other events?
8.5 Answers to exercises Section 8.1 1 Yes, via model = tf.keras.models.load_model(‘my_model.h5’); modele.evaluate(x_test, y_test) 2 You should be able to do it easily by changing the tuner to kt.RandomSearch(model_builder).
Section 8.2 1 kubectl get pod -o json. 2 Yes, you can define additional containers in the pod.spec.containers in addition to the existing single container.
Section 8.3 1 Similar to worker replicas, define parameterServer replicas in your TFJob spec to specify the number of parameter servers.
Section 8.4 1 The complete list is available here: http://mng.bz/d1Do. 2 Yes, you can use Argo Events to watch Git events and trigger workflows.
Summary We used TensorFlow to train a machine learning model for the MNIST dataset in a single machine. We learned the basic concepts in Kubernetes and gained hands-on experience by implementing them in a local Kubernetes cluster. We submitted distributed model training jobs to Kubernetes via Kubeflow. We learned about different types of templates and how to define either DAGs or sequential steps using Argo Workflows.