Hao Liang's Blog

Embrace the World with Cloud Native and Open-source

Differences in container runtime between different versions of docker

Reference articles: K8s will eventually abandon docker, and TKE already supports containerd Using docker as image building service in containerd cluster 1. Background When comparing Kubernetes clusters using different docker versions (1.18, 1.19) as container runs, we found some differences in the underlying implementation. I will make a record here. 2. Issue analysis docker 1.18 Container process tree: containerd is not a system service, but a process started by dockerd

【ETCD】Analysis of the underlying mechanism of ETCD Defrag

1. Related source code server/storage/backend/backend.go#defrag() server/storage/backend/backend.go#defragdb() 2. Why do we need defrag When we use K8s clusters daily, if we frequently add or delete cluster data, we will find a strange phenomenon: Even though the amount of object data in the cluster has not increased significantly, the disk space occupied by etcd data files is increasing. At this time, check the relevant information. etcd officially recommends using the defrag command of the provided etcdctl tool to defragment the data of each etcd node:

【Troubleshooting】A large number of pending high-priority Pods in the cluster affect the scheduling of low-priority Pods

1. Background Related issues: low priority pods stuck in pending without any scheduling events #106546 Totally avoid Pod starvation (HOL blocking) or clarify the user expectation on the wiki #86373 Related optimization proposal: Efficient requeueing of Unschedulable Pods 2. Issue analysis There are a large number of high-priority Pods in the Pending state in the cluster because the current cluster resources do not meet the resource requests of these high-priority

【Scheduling】Co-Scheduling Grouped Batch Pod Scheduler Plugin

1. Background Related proposals: Kep: Coscheduling based on PodGroup CRD Source code address: Coscheduling In some scenarios (batch-running businesses such as Spark jobs and TensorFlow jobs), a group of Pods need to be scheduled successfully at the same time before they can run normally. Some Pods still cannot run normally after being scheduled successfully. The current Kubernetes native scheduler cannot ensure that a group of Pods is created before scheduling is started.

【Kubelet】Practical analysis of Kubernetes node extension resources and Device Plugin

1. Background In kubernetes, the node is abstracted into a resource (resource). Currently, there are five officially defined attributes for the allocable resource size of the node: cpu, memory, ephemeral-storage, hugepages-1Gi, hugepages-2Mi When we create a Pod, the scheduler will determine whether the Pod’s requests (required resources) meet the allocable resources of the current node and determine whether the Pod can run on this node. In many business scenarios, it is impossible to fully describe the resource attributes of a node (such as GPU, network card bandwidth, number of allocable IPs, etc.

【Scheduling】Pod State Scheduling Scheduler plugin for scoring based on node pod status

This is the first scheduler plug-in I contributed to the scheduler-plugin open source project of the Kubernetes sig-scheduling group in 2020. 1. Background Related PR: PR103: Pod State Scheduling Plugin Source code address: Pod State Scheduling The current Kubernetes native scheduler scoring algorithm (Score) does not consider the existing Terminating status Pods on the node. The current Kubernetes native scheduler scoring algorithm (Score) does not consider the existing Nominated status Pods on the node.

【Troubleshooting】Troubleshooting a pod that remains in the nominated state after scheduling failure

1. Problem description The pod test-pod-hgfmk under the test-pod-hgfmk namespace is in the pending state, and the nominated node is the 132.10.134.193 node $ kubectl get po -n test-ns test-pod-hgfmk -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE test-pod-hgfmk 0/1 Pending 0 5m <none> <none> 132.10.134.193 The describe pod event found that scheduling failed due to insufficient cpu and memory resources, but monitoring found that there were many

【Scheduling】Capacity-scheduling Flexible Capacity Quota Scheduler Plugin

Recently, the 2021 North American KubeCon was held online. @denkensk and @yuanchen8911 are active contributors to the scheduler-plugin open source project of the Kubernetes sig-scheduling group. Brought a speech on the Capacity scheduling elastic capacity quota scheduler plug-in. 1. Background Related proposals: KEP9: Capacity scheduling. Source code address: Capacity scheduling The current Kubernetes native ResourceQuota quota mechanism is limited to a single namespace (ResourceQuota resource quota can only be configured for each namespace) When scheduling preemption occurs in a Pod, only the priority of its PriorityClass will be used as the criterion for whether to preempt it.

【Scheduling】Load-aware Load-awareness scheduler plugin

1. Background Related proposals: KEP61: Real Load Aware Scheduling. Source code address: Trimaran: Real Load Aware Scheduling The current Kubernetes native scheduling logic based on Pod Request and node Allocatable cannot truly reflect the real load of cluster nodes, so this scheduler plug-in takes the real load of nodes into the Pod scheduling logic. The core component of this plug-in Load Watcher comes from the open source project of paypal company.

【Troubleshooting】Analysis of a kube-scheduler scheduling failure problem

1. Background Recently, business Pod scheduling failures often occur online. Looking at the cluster monitoring, the resources of the cluster are indeed relatively tight, but there are still some nodes with sufficient resources. For example, the request value of the business Pod is set to: resources: limits: cpu: "36" memory: 100Gi requests: cpu: "18" memory: 10Gi There are nodes with idle resources in the cluster: Pod Event reported that there are not enough resources to schedule: 2.