【问题排查】集群中大量 Pending 的高优 Pod 影响低优 Pod 调度

1、背景

2、现象分析

集群中有大量高优先级的 Pod 处于 Pending 状态，原因是当前集群资源不满足这些高优 Pod 的资源 requests 需求。

集群中的 Pod Priority Class 如下：

$ kubectl get pc
NAME                      VALUE        GLOBAL-DEFAULT   AGE
high                      0            true             110d
low                       -1000        false            110d
medium                    -500         false            110d

大量 high Priority 的 Pod 处于 Pending，此时当创建 medium Priority 的 Pod 时（不设置 requests 和 limits，理论上能够调度成功），这些 medium Priority 的 Pod 一直处于 Pending 状态，且没有任何 Event 输出（说明在调度队列 active Queue 中一直没有被调度器取出队列进行调度）

3、原因

在 1.21 版本以前，当 Pod 调度失败时，会进入 Backoff Queue 队列，然后全部 Pod 重新入队 Active Queue：

// pkg/scheduler/eventhandlers.go
// addAllEventHandlers is a helper function used in tests and in Scheduler
// to add event handlers for various informers.
func addAllEventHandlers(
	sched *Scheduler,
	informerFactory informers.SharedInformerFactory,
	podInformer coreinformers.PodInformer,
) {
	// scheduled pod cache
	...
	// unscheduled pod queue
	podInformer.Informer().AddEventHandler(
		cache.FilteringResourceEventHandler{
			FilterFunc: func(obj interface{}) bool {
				switch t := obj.(type) {
				case *v1.Pod:
					// 所有调度失败的 pod 都重新入队
					return !assignedPod(t) && responsibleForPod(t, sched.Profiles)
				case cache.DeletedFinalStateUnknown:
					if pod, ok := t.Obj.(*v1.Pod); ok {
						return !assignedPod(pod) && responsibleForPod(pod, sched.Profiles)
					}
					utilruntime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, sched))
					return false
				default:
					utilruntime.HandleError(fmt.Errorf("unable to handle object in %T: %T", sched, obj))
					return false
				}
			},
			Handler: cache.ResourceEventHandlerFuncs{
				AddFunc:    sched.addPodToSchedulingQueue,
				UpdateFunc: sched.updatePodInSchedulingQueue,
				DeleteFunc: sched.deletePodFromSchedulingQueue,
			},
		},
	)

}

我们知道，调度队列 Active Queue 是以 Pod Priority Class 作为权重的优先队列实现的，相关代码如下：

初始化调度器：

// pkg/scheduler/factory.go
// create a scheduler from a set of registered plugins.
func (c *Configurator) create() (*Scheduler, error) {
	...
	
	// Profiles are required to have equivalent queue sort plugins.
	// 调度队列具体排序由 QueueSort 插件实现
	lessFn := profiles[c.profiles[0].SchedulerName].Framework.QueueSortFunc()
	podQueue := internalqueue.NewSchedulingQueue(
		lessFn,
		internalqueue.WithPodInitialBackoffDuration(time.Duration(c.podInitialBackoffSeconds)*time.Second),
		internalqueue.WithPodMaxBackoffDuration(time.Duration(c.podMaxBackoffSeconds)*time.Second),
	)

   ...

	return &Scheduler{
		SchedulerCache:  c.schedulerCache,
		Algorithm:       algo,
		Profiles:        profiles,
		NextPod:         internalqueue.MakeNextPodFunc(podQueue),
		Error:           MakeDefaultErrorFunc(c.client, podQueue, c.schedulerCache),
		StopEverything:  c.StopEverything,
		VolumeBinder:    c.volumeBinder,
		SchedulingQueue: podQueue,
	}, nil
}

创建调度器优先队列：

// pkg/scheduler/internal/queue/scheduling_queue.go
// NewSchedulingQueue initializes a priority queue as a new scheduling queue.
func NewSchedulingQueue(lessFn framework.LessFunc, opts ...Option) SchedulingQueue {
    return NewPriorityQueue(lessFn, opts...)
}

下面是 QueueSort 排序插件的具体实现：

// pkg/scheduler/framework/plugins/queuesort/priority_sort.go
// Less is the function used by the activeQ heap algorithm to sort pods.
// It sorts pods based on their priority. When priorities are equal, it uses
// PodInfo.timestamp.
func (pl *PrioritySort) Less(pInfo1, pInfo2 *framework.PodInfo) bool {
	p1 := pod.GetPodPriority(pInfo1.Pod)
	p2 := pod.GetPodPriority(pInfo2.Pod)
	// 1、Priority Class 高的 Pod 排在队列前面
	// 2、优先级相同时，先创建的 Pod 排在前面
	return (p1 > p2) || (p1 == p2 && pInfo1.Timestamp.Before(pInfo2.Timestamp))
}

因此，当集群有大量优先级高的 Pod 处于调度失败的 Pending 状态时，会全部重新入队再次调度，而由于它们优先级高，重新入队后又会排在低优 Pod 的前面（插队），不断调度失败，不断在低优 Pod 前插队，导致低优 Pod 一直无法出队列进入调度流程。

4、解决方案

临时解决方法：把 Pod 的 Priority Class 调高至 high，和大多数 Pending Pod 的优先级相等，此时重新入队的 Pod 会因为创建时间较晚排在队列尾部，旧的 Pod 就能够成功出队列进入调度。

彻底解决方案：在 1.21 之后的调度器版本中，优化了 Pod 调度失败后重新进入队列的逻辑，并不会将所有 Pod 重新入队（因为在集群剩余资源不变的情况下，调度失败的 Pod 再次入队也会失败，没有再次入队的必要），而是通过 watch 集群中影响 Pod 调度的资源变化来触发对应调度失败 Pod 的重新入队

例如：当有新的节点加入集群时，原本 Pending 的 Pod 才会重新入队尝试调度，因为新节点加入集群意味着集群资源变多，Pending Pod 才有重新入队的必要。具体优化方案可参考：Efficient requeueing of Unschedulable Pods

【问题排查】集群中大量 Pending 的高优 Pod 影响低优 Pod 调度

1、背景

2、现象分析

3、原因

4、解决方案

CATALOG

FEATURED TAGS

FRIENDS

1、 背景

2、现象分析

3、原因

4、解决方案

CATALOG

FEATURED TAGS

FRIENDS

1、背景