Hao Liang's Blog

Embrace the World with Cloud Native and Open-source

【Troubleshooting】Reusable CPUs from initContainer were not being honored

1. Description In early version of kubernetes v1.18. Related Commit: Fix a bug whereby reusable CPUs and devices were not being honored #93289 Related PR: Fix a bug whereby reusable CPUs and devices were not being honored #93189 Refactor the algorithm used to decide CPU assignments in the CPUManager #102014 Previously, it was possible for reusable CPUs and reusable devices (i.e. those previously consumed by init containers) to not be reused by subsequent init containers or app containers if the TopologyManager was enabled.

Posted by Hao Liang's Blog on Monday, October 23, 2023

【Troubleshooting】A large number of pending high-priority Pods in the cluster affect the scheduling of low-priority Pods

1. Background Related issues: low priority pods stuck in pending without any scheduling events #106546 Totally avoid Pod starvation (HOL blocking) or clarify the user expectation on the wiki #86373 Related optimization proposal: Efficient requeueing of Unschedulable Pods 2. Issue analysis There are a large number of high-priority Pods in the Pending state in the cluster because the current cluster resources do not meet the resource requests of these high-priority

Posted by Hao Liang's Blog on Saturday, December 4, 2021

【Troubleshooting】Troubleshooting a pod that remains in the nominated state after scheduling failure

1. Problem description The pod test-pod-hgfmk under the test-pod-hgfmk namespace is in the pending state, and the nominated node is the 132.10.134.193 node $ kubectl get po -n test-ns test-pod-hgfmk -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE test-pod-hgfmk 0/1 Pending 0 5m <none> <none> 132.10.134.193 The describe pod event found that scheduling failed due to insufficient cpu and memory resources, but monitoring found that there were many

Posted by Hao Liang's Blog on Sunday, October 24, 2021

【Troubleshooting】Analysis of a kube-scheduler scheduling failure problem

1. Background Recently, business Pod scheduling failures often occur online. Looking at the cluster monitoring, the resources of the cluster are indeed relatively tight, but there are still some nodes with sufficient resources. For example, the request value of the business Pod is set to: resources: limits: cpu: "36" memory: 100Gi requests: cpu: "18" memory: 10Gi There are nodes with idle resources in the cluster: Pod Event reported that there are not enough resources to schedule: 2.

Posted by Hao Liang's Blog on Friday, September 10, 2021