【Troubleshooting】A large number of pending high-priority Pods in the cluster affect the scheduling of low-priority Pods

Posted by Hao Liang's Blog on Saturday, December 4, 2021

1. Background

Related issues:

low priority pods stuck in pending without any scheduling events #106546

Totally avoid Pod starvation (HOL blocking) or clarify the user expectation on the wiki #86373

Related optimization proposal: Efficient requeueing of Unschedulable Pods

2. Issue analysis

There are a large number of high-priority Pods in the Pending state in the cluster because the current cluster resources do not meet the resource requests of these high-priority Pods.

The Pod Priority Class in the cluster is as follows:

$ kubectl get pc
NAME                      VALUE        GLOBAL-DEFAULT   AGE
high                      0            true             110d
low                       -1000        false            110d
medium                    -500         false            110d

A large number of high priority pods are in Pending. At this time, when creating a medium priority pod (without setting requests and limits, theoretically it can be scheduled successfully), These medium priority pods have always been in the Pending state without any Event output (indicating that they have not been taken out of the queue for scheduling by the scheduler in the active queue)

3. Cause

Before version 1.21, when Pod scheduling failed, it would enter the Backoff Queue, and then all Pods would re-enter the Active Queue:

// pkg/scheduler/eventhandlers.go
// addAllEventHandlers is a helper function used in tests and in Scheduler
// to add event handlers for various informers.
func addAllEventHandlers(
	sched *Scheduler,
	informerFactory informers.SharedInformerFactory,
	podInformer coreinformers.PodInformer,
) {
	// scheduled pod cache
	...
	// unscheduled pod queue
	podInformer.Informer().AddEventHandler(
		cache.FilteringResourceEventHandler{
			FilterFunc: func(obj interface{}) bool {
				switch t := obj.(type) {
				case *v1.Pod:
					// All pods that failed to be scheduled are re-enqueued
					return !assignedPod(t) && responsibleForPod(t, sched.Profiles)
				case cache.DeletedFinalStateUnknown:
					if pod, ok := t.Obj.(*v1.Pod); ok {
						return !assignedPod(pod) && responsibleForPod(pod, sched.Profiles)
					}
					utilruntime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, sched))
					return false
				default:
					utilruntime.HandleError(fmt.Errorf("unable to handle object in %T: %T", sched, obj))
					return false
				}
			},
			Handler: cache.ResourceEventHandlerFuncs{
				AddFunc:    sched.addPodToSchedulingQueue,
				UpdateFunc: sched.updatePodInSchedulingQueue,
				DeleteFunc: sched.deletePodFromSchedulingQueue,
			},
		},
	)

}

We know that the scheduling queue Active Queue is implemented as a priority queue with Pod Priority Class as the weight. The relevant code is as follows:

Initialize the scheduler:

// pkg/scheduler/factory.go
// create a scheduler from a set of registered plugins.
func (c *Configurator) create() (*Scheduler, error) {
	...
	
	// Profiles are required to have equivalent queue sort plugins.
	// The specific sorting of the scheduling queue is implemented by the QueueSort plug-in
	lessFn := profiles[c.profiles[0].SchedulerName].Framework.QueueSortFunc()
	podQueue := internalqueue.NewSchedulingQueue(
		lessFn,
		internalqueue.WithPodInitialBackoffDuration(time.Duration(c.podInitialBackoffSeconds)*time.Second),
		internalqueue.WithPodMaxBackoffDuration(time.Duration(c.podMaxBackoffSeconds)*time.Second),
	)

   ...

	return &Scheduler{
		SchedulerCache:  c.schedulerCache,
		Algorithm:       algo,
		Profiles:        profiles,
		NextPod:         internalqueue.MakeNextPodFunc(podQueue),
		Error:           MakeDefaultErrorFunc(c.client, podQueue, c.schedulerCache),
		StopEverything:  c.StopEverything,
		VolumeBinder:    c.volumeBinder,
		SchedulingQueue: podQueue,
	}, nil
}

Create a scheduler priority queue:

// pkg/scheduler/internal/queue/scheduling_queue.go
// NewSchedulingQueue initializes a priority queue as a new scheduling queue.
func NewSchedulingQueue(lessFn framework.LessFunc, opts ...Option) SchedulingQueue {
    return NewPriorityQueue(lessFn, opts...)
}

The following is the specific implementation of the QueueSort sorting plug-in:

// pkg/scheduler/framework/plugins/queuesort/priority_sort.go
// Less is the function used by the activeQ heap algorithm to sort pods.
// It sorts pods based on their priority. When priorities are equal, it uses
// PodInfo.timestamp.
func (pl *PrioritySort) Less(pInfo1, pInfo2 *framework.PodInfo) bool {
	p1 := pod.GetPodPriority(pInfo1.Pod)
	p2 := pod.GetPodPriority(pInfo2.Pod)
	// 1、Priority Class 高的 Pod 排在队列前面
	// 2、优先级相同时,先创建的 Pod 排在前面
	return (p1 > p2) || (p1 == p2 && pInfo1.Timestamp.Before(pInfo2.Timestamp))
}

Therefore, when a large number of high-priority Pods in the cluster are in the Pending state where scheduling fails, they will all be re-queued for re-scheduling. However, due to their high priority, they will be ranked in front of low-priority Pods after re-queuing (jumping in the queue). , Scheduling continues to fail and queues are constantly queued in front of low-quality Pods, resulting in low-quality Pods never being able to get out of the queue and enter the scheduling process.

4. Solution

Temporary solution: Increase the Priority Class of the Pod to high, which is equal to the priority of most Pending Pods. At this time, the re-enqueued Pod will be queued at the end of the queue due to its later creation time, and the old Pod will be Can be successfully dequeued and entered for scheduling.

Complete solution: In scheduler versions after 1.21, the logic of re-entering the queue after Pod scheduling fails is optimized, and all Pods will not be re-entered into the queue (because when the remaining resources of the cluster remain unchanged, Pods that have failed to be scheduled will also fail to join the queue again, and there is no need to join the queue again), Instead, the resource changes that affect Pod scheduling in the watch cluster are used to trigger the re-entry of the corresponding Pod that failed to schedule.

For example: when a new node joins the cluster, the originally Pending Pod will re-enqueue and try to schedule, because the new node joining the cluster means that the cluster resources will increase, and the Pending Pod It is necessary to rejoin the team. For specific optimization solutions, please refer to: Efficient requeueing of Unschedulable Pods