1. Description
In early version of kubernetes v1.18.
Related Commit: Fix a bug whereby reusable CPUs and devices were not being honored #93289
Related PR: Fix a bug whereby reusable CPUs and devices were not being honored #93189 Refactor the algorithm used to decide CPU assignments in the CPUManager #102014
Previously, it was possible for reusable CPUs and reusable devices (i.e. those previously consumed by init containers) to not be reused by subsequent init containers or app containers if the TopologyManager was enabled. This would happen because hint generation for the TopologyManager was not considering the reusable devices when it made its hint calculation.
As such, it would sometimes:
- Generate a hint for a different NUMA node, causing the CPUs and devices to be allocated from that node instead of the one where the reusable devices live; or
- End up thinking there were not enough CPUs or devices to allocate and throw a TopologyAffinity admission error
2. How to reproduce
- kubelet sets
--topology-manager-policy
flag tobest-effor
orrestricted
to enable TopologyManager. Refer to topology-manager - kubelet enables
--cpu-manager-policy
flag tostatic
. Refer to cpu-management-policies. - create a pod with initContainer and Guaranteed QOS.
apiVersion: v1
kind: Pod
metadata:
labels:
k8s.tencent.com/arch: vk-new
vk.k8s.tencent.com/cluster: test
name: nginx-250-1
namespace: default
spec:
initContainers:
- image: linux2.2_apd:latest
imagePullPolicy: IfNotPresent
name: test
command: ["sh", "-c", "exit 0"]
ports:
- containerPort: 80
protocol: TCP
resources:
limits:
cpu: "40"
memory: "1Gi"
requests:
cpu: "40"
memory: "1Gi"
containers:
- image: linux2.2_apd:latest
imagePullPolicy: IfNotPresent
name: nginx
ports:
- containerPort: 80
protocol: TCP
resources:
limits:
cpu: "40"
memory: "1Gi"
requests:
cpu: "40"
memory: "1Gi"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
enableServiceLinks: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
- initContainer
test
is running and exit successfully. - Container
nginx
starts to run. - check
/var/lib/kubelet/cpu_manager_state
to see the CPU affinity topology.
// /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0-3,44-51,92-95","entries":{"d47c51cb-c5a2-4910-a92b-60a399dcc581":
{
"nginx":"24-43,72-91",
"test":"4-23,52-71"
}
},"checksum":2366120229}
The Container nginx
is not reusing the CPU devices that initContainer test
previously allocated.
This will lead to not enough CPUs to allocate when other Guaranteed QOS Pods attend to allocate CPUs.
3. Under the hood
- When a initContainer exits, kubelet will set it’s requested CPU devices as
reusable
state. - When a container trys to start, kubelet generates topology Hints for available CPUs and the number of CPUs being requested accordingly.
- When generating CPU topology Hints, it does not take
reusable
CPU previously set into account. - CPU won’t be reused in a new container.
- the process of Pod allocates CPU:
- Admit —- calculateAffinity — GetTopologyHints —- setTopologyHints — allocateAlignedResources —- Allocate — removeStaleState(remove container and update defaultCPU) —- GetAffinity —- getTopologyHints —- allocateCPUs —- updateCPUsToReuse
- init container’s reusable cpu can only be reused by its own app container from the same Pod.
- init container won’t be removed when it has already exited, not util the Pod has been deleted.(reusable cpu won’t be released during the whole time)
- app container will reuse its init container’s cpuset when it meets the topology requirements()
4. Solution
- Takes
reusable
CPU previously set into account in methodgenerateCPUTopologyHints()
.
// pkg/kubelet/cm/cpumanager/policy_static.go
func (p *staticPolicy) generateCPUTopologyHints(availableCPUs cpuset.CPUSet, reusableCPUs cpuset.CPUSet, request int) []topologymanager.TopologyHint {
// Initialize minAffinitySize to include all NUMA Nodes.
minAffinitySize := p.topology.CPUDetails.NUMANodes().Size()
// Iterate through all combinations of numa nodes bitmask and build hints from them.
hints := []topologymanager.TopologyHint{}
bitmask.IterateBitMasks(p.topology.CPUDetails.NUMANodes().List(), func(mask bitmask.BitMask) {
// First, update minAffinitySize for the current request size.
cpusInMask := p.topology.CPUDetails.CPUsInNUMANodes(mask.GetBits()...).Size()
if cpusInMask >= request && mask.Count() < minAffinitySize {
minAffinitySize = mask.Count()
}
// Then check to see if we have enough CPUs available on the current
// numa node bitmask to satisfy the CPU request.
numMatching := 0
for _, c := range reusableCPUs.List() {
// Disregard this mask if its NUMANode isn't part of it.
if !mask.IsSet(p.topology.CPUDetails[c].NUMANodeID) {
return
}
numMatching++
}
// Finally, check to see if enough available CPUs remain on the current
// NUMA node combination to satisfy the CPU request.
for _, c := range availableCPUs.List() {
if mask.IsSet(p.topology.CPUDetails[c].NUMANodeID) {
numMatching++
}
}
// If they don't, then move onto the next combination.
if numMatching < request {
return
}
// Otherwise, create a new hint from the numa node bitmask and add it to the
// list of hints. We set all hint preferences to 'false' on the first
// pass through.
hints = append(hints, topologymanager.TopologyHint{
NUMANodeAffinity: mask,
Preferred: false,
})
})
// Loop back through all hints and update the 'Preferred' field based on
// counting the number of bits sets in the affinity mask and comparing it
// to the minAffinitySize. Only those with an equal number of bits set (and
// with a minimal set of numa nodes) will be considered preferred.
for i := range hints {
if p.options.AlignBySocket && p.isHintSocketAligned(hints[i], minAffinitySize) {
hints[i].Preferred = true
continue
}
if hints[i].NUMANodeAffinity.Count() == minAffinitySize {
hints[i].Preferred = true
}
}
return hints
}
- Before the fix:
{"policyName":"static","defaultCpuSet":"0-3,44-51,92-95","entries":{"d47c51cb-c5a2-4910-a92b-60a399dcc581":
{
"nginx":"24-43,72-91",
"test":"4-23,52-71"
}
},"checksum":2366120229}
- After the fix:
{"policyName":"static","defaultCpuSet":"0-3,29-51,76-95","entries":{"084f62a7-beb6-4d09-b007-6d54d9890fb8":
{
"nginx":"4-23,52-71",
"test":"4-23,52-71"
}
},"checksum":1462470388}
The Container nginx
is now reusing the CPU devices that initContainer test
previously allocated.