Hao Liang's Blog

Embrace the World with Cloud Native and Open-source

【Troubleshooting】A large number of pending high-priority Pods in the cluster affect the scheduling of low-priority Pods

1. Background Related issues: low priority pods stuck in pending without any scheduling events #106546 Totally avoid Pod starvation (HOL blocking) or clarify the user expectation on the wiki #86373 Related optimization proposal: Efficient requeueing of Unschedulable Pods 2. Issue analysis There are a large number of high-priority Pods in the Pending state in the cluster because the current cluster resources do not meet the resource requests of these high-priority

Posted by Hao Liang's Blog on Saturday, December 4, 2021

【Operating System】Go Runtime's MADV_FREE memory release issue

1. Background Related issues: runtime: memory not being returned to OS #22439 runtime: provide way to disable MADV_FREE When using applications compiled with go 1.12~1.15, it often happens that after the application is started, the resident memory RSS continues to increase as the running time increases, and the memory is never released. 2. Issue Analysis Use pprof to analyze various memory usage in Go Runtime. The following is the meaning of various memories in pprof:

Posted by Hao Liang's Blog on Saturday, November 27, 2021

【Troubleshooting】Troubleshooting a pod that remains in the nominated state after scheduling failure

1. Problem description The pod test-pod-hgfmk under the test-pod-hgfmk namespace is in the pending state, and the nominated node is the 132.10.134.193 node $ kubectl get po -n test-ns test-pod-hgfmk -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE test-pod-hgfmk 0/1 Pending 0 5m <none> <none> 132.10.134.193 The describe pod event found that scheduling failed due to insufficient cpu and memory resources, but monitoring found that there were many

Posted by Hao Liang's Blog on Sunday, October 24, 2021

【Troubleshooting】Analysis of a kube-scheduler scheduling failure problem

1. Background Recently, business Pod scheduling failures often occur online. Looking at the cluster monitoring, the resources of the cluster are indeed relatively tight, but there are still some nodes with sufficient resources. For example, the request value of the business Pod is set to: resources: limits: cpu: "36" memory: 100Gi requests: cpu: "18" memory: 10Gi There are nodes with idle resources in the cluster: Pod Event reported that there are not enough resources to schedule: 2.

Posted by Hao Liang's Blog on Friday, September 10, 2021