Dances with Cores: Meet the CPU Control Plane Plugin for Kubernetes*



Not all CPU optimizations are created equal. What if you could take previously known and new optimizations of CPU allocation for HPC and make them available in the Kubernetes* environment?

For instance, if you choose guaranteed quality of service (QoS) with a single non-uniform memory access (NUMA) policy for your pod, all of the underlying cores become statically pinned within stock Kubernetes. The current algorithms within the Kubelet* are specific, hard to configure, and limited. There’s no way to mix guaranteed QoS and best-effort/burstable QoS for a given NUMA zone, often leading to performance loss.

Enter the CPU Control Plane Plugin for Kubernetes. It’s a pluggable control plane mechanism that allows quick implementation of new algorithms for core affinity within specific jobs. Released late 2022, you can check out this solution on GitHub*.

Marlow Weston and ‪Dr. Atanas Atanasov gave an introduction to the open source CPU Control Plane at Kubernetes Batch + HPC Day North America 2022, covering how it handles artificial intelligence, machine learning, HPC and microservice workloads. It can statically pin some cores and allow others to float, link with affinity to namespaces for performance tweaking and also adds new low-latency auto-tunable power capabilities. (Isolated cores will be an option later.)


Sounds great, but the team wanted more specifics. “We had this fancy new tool and we needed to prove that it was useful,” Weston says. They went looking for toolkits and asking whether it made sense to execute manually  - no - because they want reproduction to be simple. “Engineers get things wrong, including us,” Weston says. “We wanted to automate.”

Batch tools play a big role in boosting the return on investment for benchmarking, the team wanted to enable users to study application regressions and save time. They went on a hunt for tools, first excluding bash scripts (“too hard”) then seeking options that could schedule benchmark execution, perhaps similar to Simple Linux Utility for Resource Management (Slurm). Cloud native queuing frameworks can be a valid alternative; they used Ansible* to provision underneath in benchmarking as well as handle workload deployment validation of the workload properly executed and automatic error detection logs.

Test cases

Once the tools were decided, it was time to figure out what benchmarks to use. Synthetic benchmarks would be one way to evaluate system performance - but doesn’t represent the actual workloads running in the cloud today. Cloud users have complex applications consisting of multiple microservices connected over the network. There may also be availability requirements such as P95 latency access to services.

Google* Microservices

This application is a classic three-tier app including a front-end to receive the request from the clients, business logic services and a database store to store transactions for the customers. Their benchmarking evaluated the throughput of such a system on a distributed system with four machines — and only four machines — and replace a load generator on a separate machine. There were three worker machines: the front-end, business logic and transactions. The goal was to optimize the throughput this type of system under the given latency constraints.

Death Star bench hotel reservation system

This is also a microservice-based software platform that provides search recommendation capabilities for hotels. There’s a clear split in three tiers, but the difference in this case is that there were separate databases for different parts of the data model of the application and a caching layer.

The team found that these two applications had very different scaling behavior and reacted very different to pod placement strategies on the cluster. The hotel reservation had a clear bottleneck of database components; if you run just one instance of the database it  bottlenecked. Best effort QoS business logic was able to handle the increasing number of front front-end requests. To fix the bottleneck issue rate, they executed two instances of the workload with two database layers but that wasn’t the final optimization. Thanks to dual NICs, they further isolated each workload instance on its own socket, with careful network configuration.  “It was very hard coded — so this isn't available easily today,” Weston says.

They found that the best quality the best effort QoS did not provide any benefits on Google Microservices. The workloads suffered under noisy neighbor problem, which the team fixed that by pinning the services and isolating again on a separate socket. All unused cores on the socket used the remaining group of services, which were not sensitive to cache-related issues.

Utilization rates

How important is extending these primitives? The team saw a change from 40% to 78% utilization. Both use cases had similar results. “We really want to be closer to 90%,” Weston says, “So there's probably more we could do about the database as the bottleneck.” She adds that the team isn’t sure what other optimizations can boost core utilization up and that these two sample workloads are still very far from actual applications and what users deploy on systems.

Users are starting to use the cloud platform to run massive workloads for genomics, AI, HPC apps and more. These applications are performance critical and usually optimized for a certain placement model. HPC and AI applications apply pinning and affinity configuration mechanism to get out the max available compute from the hardware. Currently a lot of these applications are still using slurm — including internally at Intel —  for fine-grain performance, Weston adds.

The team also wants to measure other things, for example, it's insufficient to analyze workloads on only one to two nodes. The four nodes used to test the case are still very far from reality of the cores customers are running on hundreds or thousands of machines. They’re also starting to look at throughput latency and throughput under latency constraints but would also like to understand how these metrics behave at scale and whether latency goes down when pods are added or if does throughput increases.

Check out the entire presentation on YouTube or the slides.

About the Presenters

Marlow Weston is a cloud software architect at Intel working on resource management with a focus on performance and sustainability.

Dr. Atanas Atanasov is a cloud-native senior software engineer who has been at Intel for six years.