Hao Liang's Blog

Embrace the World with Cloud Native and Open-source

Customize Kubernetes cpuset strategy in GPU container task

1. Brief description This article takes the NVIDIA L40s GPU device as an example to briefly describe its GPU topology and the current CPU core-binding capabilities of Kubernetes. To briefly review, the NVIDIA L40s GPU device topology is shown in the figure below: The CPU of this model has a total of 384 cores, distributed on 2 NUMA nodes. Each GPU has a mellanox high-speed RDMA network card and is hung under the same PCIe Bridge.

Posted by Hao Liang's Blog on Sunday, December 8, 2024

Kubelet Streaming Server Port Closed Unexpectedly

1. Description Kernel version: 5.4.241 kubelet version: 1.22.5 nvidia driver version: 535.161.08 and 535.154.05 After the kubelet process on the node is started, it listens to a random port (46127) in the range of ip_local_port_range ss -lntpe |grep kubelet code snippets: After running for a while, the listen port suddenly disappeared The corresponding fd (fd=13) is also closed, but the kubelet process still exists 2. Analysis From the corresponding kubelet code snippets, we found that the streaming server is pulled up through a separate goroutine.

Posted by Hao Liang's Blog on Saturday, July 13, 2024

Kubelet Streaming Server 端口异常关闭

1. 问题描述内核版本：5.4.241 kubelet版本：1.22.5 nvidia驱动版本：535.161.08 和 535.154.05 节点上的 kubelet 进程启动后，监听了

Posted by Hao Liang's Blog on Saturday, July 13, 2024

Renaming Node Name without Resetting kubelet Environment

Goal Rename any node name in Kubernetes cluster. No need to reset the whole kubelet environment like most of the approaches. No need to drain any Pods running on the Node. Bootstrap Process of kubelet Doc refer to: https://kubernetes.io/docs/reference/access-authn-authz/kubelet-tls-bootstrapping/ Chinese version introduction refer to: https://cloud.tencent.com/developer/article/1656007 The kubelet process starts. Try to find kubeconfig file specified by arg --kubeconfig=xxx, if not found, try to find bootstrap-kubeconfig file specified by arg --bootstrap-kubeconfig=xxx instead.

Posted by Hao Liang's Blog on Friday, December 22, 2023

【Kubelet】Practical analysis of Kubernetes node extension resources and Device Plugin

1. Background In kubernetes, the node is abstracted into a resource (resource). Currently, there are five officially defined attributes for the allocable resource size of the node: cpu, memory, ephemeral-storage, hugepages-1Gi, hugepages-2Mi When we create a Pod, the scheduler will determine whether the Pod’s requests (required resources) meet the allocable resources of the current node and determine whether the Pod can run on this node. In many business scenarios, it is impossible to fully describe the resource attributes of a node (such as GPU, network card bandwidth, number of allocable IPs, etc.

Posted by Hao Liang's Blog on Sunday, November 7, 2021