Kubernetes* workloads are divided into three quality of service (QoS) classes:
- Guaranteed workloads get the CPU and memory they request.
- Burstable workloads get at least the minimum they request.
- BestEffort workloads share what is left.
But, a deeper look into Kubernetes nodes reveals that some performance-critical resources are always used in the BestEffort manner by all workloads. For example, high-priority workloads cannot take priority over others even when that is badly needed. But some workloads run with less effort than BestEffort, meaning throttled.
We concentrate on three resources that have been "best effort for all" until the recent versions of two container runtimes, CRI-O and containerd: storage I/O bandwidth, memory bandwidth, and CPU cache. We start by showing how to prioritize and deprioritize use of these resources. Then we generalize the approach to include these and similar resources under the extended umbrella of Kubernetes QoS.
Example of Controlling Interference
Consider the example of three workloads that are scheduled on a node in a Kubernetes cluster.
- 911: A performance-critical, low-latency workload that must run without disruptions whenever needed.
- Flower delivery: A user service that meets the service level agreement.
- File system scanner: One of the node background integrity checks. It's heavy on I/O when running, but not time-critical.
There are two ways to tune these workloads. First, ensure that the 911 workload has more bandwidth and higher priority than all other workloads. Second, throttle the file system scanner to control how much it can disturb other services on the node.
Exclusive Resources and High Priority
The 911 workload runs on exclusive CPU cores. We configure them using the static CPU manager policy in kubelet (the Kubernetes node agent). If more detailed control of CPU cores were needed (for example: making sure that they are close to the same memory controller or some other devices such as GPUs) a container runtime called Container Runtime Interface Resource Manager (CRI-RM) can be used.
Figure 1 illustrates an Intel® Resource Directory Technology (Intel® RDT) configuration with a class that has an exclusive portion of an L3 cache and unrestricted memory bandwidth. We assign the 911 workload to this RDT class using the RDT pod annotation.
Let's now make the storage I/O priority of the 911 workload exceed the priority of all other workloads. Use an I/O scheduler (such as Budget Fair Queueing [BFQ]) that supports priorities and assigns the workload to a blockio class that has higher blockio weight than all others, including the default blockio weight. See Figure 1:
911-pod.yaml:
annotations:
rdt.resources.beta.kubernetes.io/pod: highprio
blockio.resources.beta.kubernetes.io/pod: highprio
rdt-classes.yaml:
partitions:
exclusive:
l3Allocation: "80%"
mbAllocation: ["100%"]
classes:
highprio:
l3Allocation: "100%" # this class gets all the cache in this partition
blockio-classes.yaml:
classes:
highprio:
- weight: 400
Figure 1: High priority RDT and blockio configuration of the 911 pod.
Throttling and Low Priority
As shown in Figure 2, define the CPU utilization limit for the file system scanner workload by setting the Kubernetes CPU resource limits. In the throttling configuration, set the limit to 25% of CPU core time.
fs-scanner-pod.yaml:
annotations:
rdt.resources.beta.kubernetes.io/pod: throttledmembw
blockio.resources.beta.kubernetes.io/pod: lowpriothrottle
...
containers:
resources:
limits:
cpu: 250m
rdt-classes.yaml:
partitions:
...
shared:
l3Allocation: ["20%"]
mbAllocation: ["20%"]
classes:
throttledmembw:
mbAllocation: ["100%"]
blockio-classes.yaml:
classes:
lowpriothrottle:
- devices:
- ...
throttlereadbps: 60M
throttlewritebps: 40M
weight: 40
In the figure, limit the memory bandwidth of the file system scanner by assigning it to an RDT class with memory bandwidth allocation. By default, it uses all the bandwidth it can get (shares bandwidth equally with the 911 and flower delivery workloads). In the example configuration, the memory bandwidth is limited to 20%. We also restrict L3 cache usage to 20% (since 80% was exclusively allocated to the highprio RDT class introduced in Figure 1).
The file system scanner storage I/O is throttled by its blockio class. Throttling parameters allows you to limit bandwidth and the number of read and write operations. Our example configuration throttles read bandwidth to 60 MB/s and write bandwidth to 40 MB/s.
Hardware and Software Dependencies
To use RDT and blockio classes as previously discussed, follow these hardware and software requirements:
- Linux*
- RDT classes
- Software requirements:
- CRI runtime: CRI-O version 1.22 or containerd version 1.6
- OCI runtime: runc version 1.1.0
- resctrl pseudo-file system is mounted on the host
- Hardware requirements: Intel RDT
- Software requirements:
- Blockio classes
- Software requirements:
- CRI runtime: CRI-O version 1.22 or containerd version 1.7
- OCI runtime: runc version 1.0.0 or later
- I/O scheduler weight adjustments require an I/O scheduler that supports weights, such as BFQ.
- Software requirements:
Generalizing for the Future
Let's discuss why we shared bandwidths and CPU caches based on new class memberships, instead of the absolute quantities or existing Kubernetes QoS classes. We explain how to make this official in Kubernetes, and introduce the next step on that path: using the class-based resources for Kubernetes enhancement proposal (KEP).
Resource-Specific QoS
We had three options for sharing these resources among workloads:
- Allocate new resources
- Use existing properties of workloads, such as existing pod QoS classes
- Use new properties of workloads, such as resource-specific QoS classes.
Let's look at each of these options.
Allocation of resources such as bandwidth has severe problems. First, the absolute available quantities may be unknown. For example, SSD read and write bandwidth depends on how it is accessed (random or sequential), while memory bandwidth depends on the nature of the operations executed and the locality of memory to the CPU socket. Second, exclusive allocation hurts other workloads, possibly without obvious benefit to the high-priority workload. Finally, the meaning of an absolute number may vary with node hardware. For example, throttling storage I/O of the file system scanner down to 200 MB/s is unnecessarily slow on a node with an Intel® Optane™ SSD. It consumes practically all the bandwidth of another SSD, and might be double the bandwidth theoretically available on a hard disk. Our conclusion is that allocating uncountable resources is unreliable, complicated, and capable of reducing performance without clear benefits.
Existing pod QoS classes in Kubernetes are already highly overloaded. They are implicitly derived from the CPU and memory resources of pods. In addition, they affect out-of-memory (OOM) killer behavior, pod eviction, and CPU pinning. This has unnecessary, even unwanted, implications. Although it works on certain clusters, unnecessary and unwanted implications generally increase if these QoS classes directly affected storage I/O, memory, or cache size settings.
New properties of workloads such as resource-specific QoS classes hide complex details about hardware performance differences from workloads, while still allowing requests for more resources by high-priority workloads and limiting the disturbance caused by low-priority workloads. Unlike allocations, new QoS classes alone do not allow you to limit the number of high-priority workloads per node, but there are means to achieve this. As a simple example, you can expose an extended resource named after the priority class, together with the number of workloads that can belong to that class.
The Class-Based Resources KEP
We have introduced the concept of resource QoS classes and implemented two types of them (RDT and blockio) in the CRI-O and containerd runtimes. Currently, Kubernetes is unaware of these new QoS classes, and pod annotations are used to pass information about the resource QoS class from the Kubernetes user down to container runtimes. QoS resources are not defined in the Kubernetes APIs, so the available QoS resources, access control, and validation are not available to the user.
The Class-based Resources KEP integrates these new resource QoS classes into Kubernetes, using RDT and blockio as existing practical use cases. In addition, the proposal discusses clarifying the use of existing Pod QoS classes. This property is not exposed outside Kubernetes, and container runtimes try to determine it using side channels to optimize workload behavior. Passing the Pod QoS class information explicitly from the kubelet to runtimes simplifies runtime codebases and unifies their behavior. This also creates possibilities for future improvements such as fixing existing and unwanted implications like OOM killer behavior.
We invite you to share your thoughts about the proposal and the future of QoS in Kubernetes in the KEP discussion.