【Scheduling】Load-aware Load-awareness scheduler plugin

1. Background

Related proposals: KEP61: Real Load Aware Scheduling.

Source code address: Trimaran: Real Load Aware Scheduling

The current Kubernetes native scheduling logic based on Pod Request and node Allocatable cannot truly reflect the real load of cluster nodes, so this scheduler plug-in takes the real load of nodes into the Pod scheduling logic. The core component of this plug-in Load Watcher comes from the open source project of paypal company. The Load Watcher component is responsible for collecting cluster-wide actual resource usage data (including CPU, memory usage, network IO, disk IO usage) from data sources such as Prometheus, SignalFx, Kubernetes Metrics Server, etc.

2. Implementation principle

The Load-aware scheduler plug-in collects the actual resource usage data of the cluster through the Load Watcher component, records the actual resource usage of each node in the cluster, and scores according to the actual load of the node during the scheduling and scoring stage, so that the Pod can be scheduled to the appropriate location. on the node. Among them, the Load-aware scheduler includes two scoring extension plug-ins (Score Plugin): LoadVariationRiskBalancing and Targetloadpacking.

a. LoadVariationRiskBalancing

The load risk balancer takes into account the average usage and usage change rate of the node’s resources (supporting CPU and memory) to calculate the load risk (risk), which is ultimately reflected in the scheduler’s score plug-in, affecting each The fractional weight of the node in scheduling. Load risk (risk) is calculated as follows:

risk = [ average + margin * stDev^{1/sensitivity} ] / 2

The greater the calculated load risk, the lower the score. The final node score is calculated as follows:

score = maxScore * (1 - worstRisk)

Scheduler plug-in configuration example:

apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: false
profiles:
- schedulerName: trimaran
  plugins:
    score:
      enabled:
       - name: LoadVariationRiskBalancing
  pluginConfig:
  - name: LoadVariationRiskBalancing
    args:
      safeVarianceMargin: 1
      safeVarianceSensitivity: 2
      metricProvider:
        type: Prometheus
        address: http://prometheus-k8s.monitoring.svc.cluster.local:9090

b. Targetloadpacking

Scoring scheduling algorithm based on target load:

Configure the targetUtilization parameter to specify the target usage percentage of the CPU (default is 40)
Configure the defaultRequests parameter to specify the container’s CPU default requests expectations (used when Pod does not set requests and limits, defaults to 1 core)
Configure the defaultRequestsMultiplier parameter to specify the oversold coefficient of the container’s CPU requests value (used when the Pod only sets requests but does not set limits, defaults to 1.5)
nodeCPUUtilMillis is the real node CPU usage collected through the Load Watcher component
nodeCPUCapMillis is the total CPU size of the node
missingCPUUtilMillis for node

The final node score is calculated as follows:

// When Pod does not set requests and limits
predictedCPUUsage = (defaultRequests + nodeCPUUtilMillis) * 100 / nodeCPUCapMillis

// When the Pod only sets requests but does not set limits
predictedCPUUsage = ( ( defaultRequestsMultiplier * pod requests) + nodeCPUUtilMillis ) * 100 / nodeCPUCapMillis

// score calculation
score = ( 100 - targetUtilization) * predictedCPUUsage / targetUtilization + targetUtilization

Scheduler plugin configuration example:

apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: false
profiles:
- schedulerName: trimaran
  plugins:
    score:
      disabled:
      - name: NodeResourcesBalancedAllocation
      - name: NodeResourcesLeastAllocated
      enabled:
       - name: TargetLoadPacking
  pluginConfig:
  - name: TargetLoadPacking
    args:
      defaultRequests:
        cpu: "2000m"
      defaultRequestsMultiplier: "2"
      targetUtilization: 70
      metricProvider: 
        type: Prometheus
        address: http://prometheus-k8s.monitoring.svc.cluster.local:9090

3. User Manual

Among them, the Load Watcher component used by the scheduler plug-in has two deployment methods:

a. By separately deploying a third-party Load Watcher

To use this method, you need to deploy the Load Watcher component separately in the cluster, expose the service through Service, and configure it in the watcherAddress configuration item of the scheduler plug-in, for example:

watcherAddress: http://xxxx.svc.cluster.local:2020/

Deployment architecture:

b. Through built-in Load Watcher

Using this method, the scheduler plug-in embeds the Load Watcher component logic, and you only need to configure the data source related information in the configuration of the scheduler plug-in, for example:

apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: false
profiles:
- schedulerName: trimaran
  plugins:
    score:
      enabled:
       - name: LoadVariationRiskBalancing
  pluginConfig:
  - name: LoadVariationRiskBalancing
    args:
      metricProvider:
        type: Prometheus
        address: http://prometheus-k8s.monitoring.svc.cluster.local:9090
      safeVarianceMargin: 1
      safeVarianceSensitivity: 2

The above configuration configures the data source type as Prometheus and the data source address as http://prometheus-k8s.monitoring.svc.cluster.local:9090.