【Troubleshooting】Reusable CPUs from initContainer were not being honored

Posted by Hao Liang's Blog on Monday, October 23, 2023

1. Description

In early version of kubernetes v1.18.

Related Commit: Fix a bug whereby reusable CPUs and devices were not being honored #93289

Related PR: Fix a bug whereby reusable CPUs and devices were not being honored #93189 Refactor the algorithm used to decide CPU assignments in the CPUManager #102014

Previously, it was possible for reusable CPUs and reusable devices (i.e. those previously consumed by init containers) to not be reused by subsequent init containers or app containers if the TopologyManager was enabled. This would happen because hint generation for the TopologyManager was not considering the reusable devices when it made its hint calculation.

As such, it would sometimes:

  1. Generate a hint for a different NUMA node, causing the CPUs and devices to be allocated from that node instead of the one where the reusable devices live; or
  2. End up thinking there were not enough CPUs or devices to allocate and throw a TopologyAffinity admission error

2. How to reproduce

  • kubelet sets --topology-manager-policy flag to best-effor or restricted to enable TopologyManager. Refer to topology-manager
  • kubelet enables --cpu-manager-policy flag to static. Refer to cpu-management-policies.
  • create a pod with initContainer and Guaranteed QOS.
apiVersion: v1
kind: Pod
metadata:
  labels:
    k8s.tencent.com/arch: vk-new
    vk.k8s.tencent.com/cluster: test
  name: nginx-250-1
  namespace: default
spec:
  initContainers:
  - image: linux2.2_apd:latest
    imagePullPolicy: IfNotPresent
    name: test
    command: ["sh", "-c", "exit 0"]
    ports:
    - containerPort: 80
      protocol: TCP
    resources:
      limits:
        cpu: "40"
        memory: "1Gi"
      requests:
        cpu: "40"
        memory: "1Gi"
  containers:
  - image: linux2.2_apd:latest
    imagePullPolicy: IfNotPresent
    name: nginx
    ports:
    - containerPort: 80
      protocol: TCP
    resources: 
      limits:
        cpu: "40"
        memory: "1Gi"
      requests:
        cpu: "40"
        memory: "1Gi"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  • initContainer test is running and exit successfully.
  • Container nginx starts to run.
  • check /var/lib/kubelet/cpu_manager_state to see the CPU affinity topology.
// /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0-3,44-51,92-95","entries":{"d47c51cb-c5a2-4910-a92b-60a399dcc581":
  {
     "nginx":"24-43,72-91",
     "test":"4-23,52-71"
  }
},"checksum":2366120229}

The Container nginx is not reusing the CPU devices that initContainer test previously allocated. This will lead to not enough CPUs to allocate when other Guaranteed QOS Pods attend to allocate CPUs.

3. Under the hood

  • When a initContainer exits, kubelet will set it’s requested CPU devices as reusable state.
  • When a container trys to start, kubelet generates topology Hints for available CPUs and the number of CPUs being requested accordingly.
  • When generating CPU topology Hints, it does not take reusable CPU previously set into account.
  • CPU won’t be reused in a new container.
  1. the process of Pod allocates CPU:
    • Admit —- calculateAffinity — GetTopologyHints —- setTopologyHints — allocateAlignedResources —- Allocate — removeStaleState(remove container and update defaultCPU) —- GetAffinity —- getTopologyHints —- allocateCPUs —- updateCPUsToReuse
  2. init container’s reusable cpu can only be reused by its own app container from the same Pod.
  3. init container won’t be removed when it has already exited, not util the Pod has been deleted.(reusable cpu won’t be released during the whole time)
  4. app container will reuse its init container’s cpuset when it meets the topology requirements()

4. Solution

  • Takes reusable CPU previously set into account in method generateCPUTopologyHints().
// pkg/kubelet/cm/cpumanager/policy_static.go
func (p *staticPolicy) generateCPUTopologyHints(availableCPUs cpuset.CPUSet, reusableCPUs cpuset.CPUSet, request int) []topologymanager.TopologyHint {
	// Initialize minAffinitySize to include all NUMA Nodes.
	minAffinitySize := p.topology.CPUDetails.NUMANodes().Size()

	// Iterate through all combinations of numa nodes bitmask and build hints from them.
	hints := []topologymanager.TopologyHint{}
	bitmask.IterateBitMasks(p.topology.CPUDetails.NUMANodes().List(), func(mask bitmask.BitMask) {
		// First, update minAffinitySize for the current request size.
		cpusInMask := p.topology.CPUDetails.CPUsInNUMANodes(mask.GetBits()...).Size()
		if cpusInMask >= request && mask.Count() < minAffinitySize {
			minAffinitySize = mask.Count()
		}

		// Then check to see if we have enough CPUs available on the current
		// numa node bitmask to satisfy the CPU request.
		numMatching := 0
		for _, c := range reusableCPUs.List() {
			// Disregard this mask if its NUMANode isn't part of it.
			if !mask.IsSet(p.topology.CPUDetails[c].NUMANodeID) {
				return
			}
			numMatching++
		}

		// Finally, check to see if enough available CPUs remain on the current
		// NUMA node combination to satisfy the CPU request.
		for _, c := range availableCPUs.List() {
			if mask.IsSet(p.topology.CPUDetails[c].NUMANodeID) {
				numMatching++
			}
		}

		// If they don't, then move onto the next combination.
		if numMatching < request {
			return
		}

		// Otherwise, create a new hint from the numa node bitmask and add it to the
		// list of hints.  We set all hint preferences to 'false' on the first
		// pass through.
		hints = append(hints, topologymanager.TopologyHint{
			NUMANodeAffinity: mask,
			Preferred:        false,
		})
	})

	// Loop back through all hints and update the 'Preferred' field based on
	// counting the number of bits sets in the affinity mask and comparing it
	// to the minAffinitySize. Only those with an equal number of bits set (and
	// with a minimal set of numa nodes) will be considered preferred.
	for i := range hints {
		if p.options.AlignBySocket && p.isHintSocketAligned(hints[i], minAffinitySize) {
			hints[i].Preferred = true
			continue
		}
		if hints[i].NUMANodeAffinity.Count() == minAffinitySize {
			hints[i].Preferred = true
		}
	}

	return hints
}
  • Before the fix:
{"policyName":"static","defaultCpuSet":"0-3,44-51,92-95","entries":{"d47c51cb-c5a2-4910-a92b-60a399dcc581":
  {
     "nginx":"24-43,72-91",
     "test":"4-23,52-71"
  }
},"checksum":2366120229}
  • After the fix:
{"policyName":"static","defaultCpuSet":"0-3,29-51,76-95","entries":{"084f62a7-beb6-4d09-b007-6d54d9890fb8":
  {
     "nginx":"4-23,52-71",
     "test":"4-23,52-71"
  }
},"checksum":1462470388}

The Container nginx is now reusing the CPU devices that initContainer test previously allocated.