Recap of KubeCon AI_dev HongKong 2024

Posted by Hao Liang's Blog on Friday, August 30, 2024

1. Introduction to KubeCon

The Cloud Native Computing Foundation’s flagship conference joins with Open Source Summit and AI_dev to gather adopters and technologists from leading open source and cloud native communities in Hong Kong from 21-23 August 2024. KubeCon is held every year in 3 places in Europe, America, and Asia. It has entered China since 2018. This is the third time I have attended in KubeCon. The first two times I attended as a speaker.

2. Topic

In this conference, I personally pay more attention to the topic related to AI, GPU, and scheduling related to my work. Here are a few topics that I focused on and had in-depth exchanges with the speakers.

Is Your GPU Really Working Efficiently in the Data Center? N Ways to Improve GPU Usage - Xiao Zhang, DaoCloud & Wu Ying Jun, China Mobile

link

The topic of this talk focuses on issues such as GPU utilization, performance, and efficiency encountered when running AI applications in a K8s cluster, and explores how to improve GPU MFU, communication performance, and training efficiency when running AI applications in a cluster. Specific optimizations include the following:

  • Communication performance:
    • Optimize the parallel strategy: In a multi-parallel strategy training scenario, the communication data volume is TP (Tensor Parallel) > DP (Data Parallel) > PP (Pipeline Parallel), so when dividing parallel strategy groups, try to place ranks from the same TP group on the same node, ranks from the same DP groups under the same leaf switch (layer one), ranks from the same PP groups under the same spine switch (layer two) to achieve the best performance.
    • Switch topology-aware scheduling: Aware of switch topology during scheduling to improve communication performance.
  • Training efficiency:
    • Checkpoint optimization: Fast writing to sharedmemory, reducing the time for checkpoint to interrupt training, from minute-level checkpoint optimization to second-level checkpoint optimization.
    • Faulty node shielding and automatic rescheduling recovery
  • GPU Utilization:
    • GPU sharing: adopting CNCF’s sandbox open source project - HAMi.
    • High and low priorities tasks realize mutual preemption of computing resources.
    • Elastic quota, temporary leasing of idle computing power between two tenants
    • The GPU memory is over-committed and over-allocated, and the GPU memory is replaced with the host memory.

The overall optimization idea is relatively comprehensive and has been optimized from multiple directions such as scheduling, communication, and GPU sharing. As the scale of GPU training of various vendors continues to expand, I believe this will be a major step in the optimization of large-scale GPU training in the future.

Sit Back and Relax with Fault Awareness and Robust Instant Recovery for Large Scale AI Workloads - Fanshi Zhang & Kebe Liu, DaoCloud

link

This topic discusses how to quickly locate faults in GPU hardware, RDMA networks, and GPU training/inference frameworks in k8s clusters through the open source project kcover, and automatically detect faulty nodes. Shield and restore to improve the stability and resilience of AI applications.

Ideas for implementing this project:

  • Collect GPU, PCIe, NCCL, and pytorch error logs generated by nodes and containers, conduct keyword matching, identify and analyze fault types, and detect faulty nodes
  • Evict and reschedule containers on failed nodes so that tasks can be automatically restored

It is inevitable that various failures will occur in large-scale GPU training. The recent Llama3 training paper released by Meta also mentioned that they were interrupted hundreds of times due to various failures during the training process, which greatly affected the training efficiency. However, from the kcover project itself and conversations with the maintainers of the project, we also found that the project has certain limitations:

  • The method of identifying exceptions based on log keywords can identify very limited error types, and it is often impossible to locate the root cause.
  • It is impossible to locate the reason why the task hangs, or the slow communication or calculation of a certain node that slows down the overall training speed, because these problems do not have clear fault prompts.

All in all, kcover has given us a good inspiration. Building a cloud-native unified GPU task fault intelligent diagnosis and self-healing platform (unified log collection, analysis, shielding, and recovery) is a major solution to the problem of large-scale GPU training faults in the future.

Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM Training - William Wang, Huawei

link

This talk introduces how to improve the RDMA network communication performance of multi-node training tasks by implementing inter-node and intra-node topology-aware scheduling of GPU nodes in a K8s cluster.

Its main features:

  • Implement RDMA network topology-aware scheduling and rescheduling of GPU tasks (inter-node and intra-node communication)
    • Inter-node communication topology optimization: spine/leaf networking to avoid crossing spines during scheduling
    • In-node communication topology optimization: NVLink, NUMA affinity scheduling
  • Build resources model and workloads model
    • Resource model (HyperNode): divide node groups according to network performance (nodes with the same performance are grouped together)
    • Workload model (HyperJob): Group according to performance requirements and match node groups with corresponding performance

The entire solution is implemented in the internal version of volcano of Huawei Cloud. It will be gradually open sourced into the community version of volcano. It remains to be seen whether the solution is universal.

Unlocking Heterogeneous AI Infrastructure K8s Cluster: Leveraging the Power of HAMi - Xiao Zhang, DaoCloud & Mengxuan Li, The 4th Paradigm

link

The talk introduced the details of the GPU sharing project - HAMi. The project mainly virtualizes one GPU card into multiple virtual cards by sharing the GPU, achieving a more fine-grained division of computing power and GPU memory, and reducing the cost of using the GPU for users.

Let’s briefly introduce the implementation principle of the GPU sharing solution of this project:

  • Use user-mode CUDA API interception technology to intercept core API interfaces such as GPU kernel launch and GPU memory allocation to achieve interception of computing resources and GPU memory. (Similar to the vCuda solution open sourced by Tencent in 2018)
  • In addition to GPU sharing support for NVIDIA cards, HAMi also supports the sharing of heterogeneous cards such as Huawei and Cambrian at the framework level (here refers to the support of scheduler and device-plugin ecosystem, and the core GPU sharing interception library is provided by the hardware vendors)

Finally, I had an in-depth exchange with the maintainer of this project for a long time, and found that there are still many areas worth optimizing and improving in this project:

  • The current computing power cutting solution has weak isolation and hysteresis (refer to the GPU utilization of the previous time for computing resources suppression)
  • If there is a scenario of periodic communication synchronization in a multi-card task, the timing at which the GPU computing resources of each card is suppressed is not synchronized, which will slow down the entire training progress.
  • The implementation of over-committed and over-allocated GPU memory is too simple. GPU memory is allocated through UMI (Unified Memory), and the NVIDIA driver determines which GPU memory needs to be replaced with memory. It is only applicable to tidal Online-Offline scenarios (one of the Offline tasks is idle) and cannot be dealt with. GPU memory requirements in some special scenarios. (For example: GPU memory fragmentation problem, two tasks need to allocate more GPU memory at the same time, resulting in performance degradation)

Empower Large Language Models (LLMs) Serving in Production with Cloud Native AI Technologies - Lize Cai, SAP & Yang Che, Alibaba Cloud Intelligence

link

This sharing is mainly for KServe, a very popular cloud-native AI inference project in the community, and introduces how this project manages the life cycle of LLM inference.

This sharing is more about introducing the features of Kserve. I will briefly mention it here without going into details. If you are interested, you can directly read the official documentation of KServe. KServe essentially abstracts the various stages involved in GPU inference at the K8s CRD level (for example: model, network, service instance):

  • InferenceService: user-oriented, defining the model format, model name, model location, and required resources of the inference service
  • ClusterServingRuntime: For administrators, it defines the model format, protocol, and required resources supported by the cluster.
  • TrainedModel: Define model attributes, size, and save address
  • InferenceGraph: Define inference workflow graph

After a series of abstractions, the deployment of model inference services can be closer to the user’s usage scenarios, and the management of model services can be more efficient. Using KServe in K8s to deploy model inference services is also a very typical example of the combination of AI and K8s. I believe that more open source projects will support richer K8s for AI scenarios in the future.

3. Other Interesting Events

Ambassador Breakfast

I had breakfast with a few CNCF Ambassadors (Ms. Fog and Whitney, my former colleagues Huabing Zhao, Hoon Jo from Korea and Xiaohui Zhang from Flomesh) and talked about many cultural differences such as work and life in difference countries. We usually communicate very actively in the online open source community. I am very happy to meet them offline this time.

Ambassador & Maintainer Dinner with CNCF staffs

In the evening, CNCF staff (Jeff and Jorge) warmly invited our community maintainers and ambassadors to have dinner together. We chatted all night, and the topics ranged from “how to host a KubeCon”, “how to review KubeCon topics”, “what is it like to work at CNCF” to “various gossips about people we know well in the community”. There was constant laughter and laughter. Jeff and Jorge made the dinner lively, and we even chatted about the F1 topic that interests me the most (everyone complained about Max Verstappen winning the championship again this year).

Met with Linus Torvalds

This was the most unexpected and meaningful experience at KubeCon. I was about to go back to the hotel after KubeCon when I suddenly noticed a familiar figure appearing downstairs at the venue. When I walked in, I saw that it was indeed Linus. At this time, no one was aware of Linus’s appearance, so I quickly went up to chat with him for a few words, and took a precious selfie. It feels like Linus is completely different in person and online. He smiled friendly throughout the whole process, and took photos and conversations with everyone. We talked about his schedule for the next few days and his views on this Keynote sharing (haha, He said that participating in this sharing felt like he was forced to stay in business, but he hadn’t traveled for a long time, and it would be nice to travel far away occasionally). Unexpectedly, Linus has still been working for the Linux Foundation for so many years and has not retired. Not long after, everyone noticed Linus was there and swarmed around him. I hurriedly said goodbye to Linus and left.

I never expected to be able to talk and take photos with the legendary Linus in this life. It feels like a dream.

In the end, Many thanks to the Linux Foundation and CNCF for bringing me so many wonderful and unforgettable experiences, it really makes my day, can’t wait to see you guys next year!