Overview
Virtual machine (VM) performance is often blocked by frequent VM exits, which occur when guest operating systems run halt (HLT) instructions during idle periods. This article introduces the guest halt polling technique to enhance the performance of virtual machines by reducing the overhead of VM exits for HLT instructions. By implementing a polling interval within the guest VM, this method delays running HLT, avoiding additional scheduling by the host scheduler. The proposed method uses a 50K_10K configuration, giving substantial improvements in gaming performance, with frames per second (FPS) increases ranging from 4% to 71% across various Steam* games.
The guest halt polling technique minimizes VM exits by implementing a polling mechanism within the guest itself before control is handed over to the virtual machine manager (VMM). The guest idle halt polling parameter enables the vCPU to wait for a period (that is, polling) when it is idle, delaying the running of HLT instructions. The overhead and CPU cycles taken up by exiting and again entering the VM state are saved, which in turn leads to performance gains.
In this work, we have supported KVM halt polling and guest halt polling optimizations, such as configured optimal polling values and interval, to reduce VM exits due to HLT, which improves the performance by reducing CPU cycles that are caused by VM overhead. The results show that our implementation has achieved the following:
- Improvement in the average FPS of 10 Steam games, ranging from 4% to 71%.
- Reduction in HLT-VM_Exits by 6% with 50K_10K configuration.
- Improvement in crosvm vCPU allocation time resulting in better performance.
Motivation
Background – VM Exits
Processor support for virtualization is provided by a form of processor operation called virtual machine extension (VMX) operation. There are two kinds of VMX transitions. Transitions into VMX nonroot operation are called VM entries. Transitions from VMX nonroot operation to VMX root operation are called VM exits. Processor behavior in VMX root operation is much like normal CPU operation. Processor behavior in VMX nonroot operation is restricted and modified to facilitate virtualization. Instead of their ordinary operation, certain instructions (including the new VMCALL instruction) and events introduce VM exits to the VMM. Because these VM exits replace ordinary behavior, the functionality of software in VMX nonroot operation is limited.
VM exits in response to certain instructions and events (such as a page fault) are a key source of performance degradation in a virtualized system. The following figure shows how this works.
Figure 1. VM exits and VM enter flows
A VM exit marks the point at which a transition is made between the VM currently running and the VMM (hypervisor) that must exercise system control for a particular reason. In general, the processor must save a snapshot of the VM's state as it was running at the time of the exit. Kernel-based virtual machine (KVM) is a virtualization module in the Linux* kernel that allows the kernel to function as a hypervisor. For Intel® architectures, here are the steps to save a snapshot (refer to Figure 2 for an illustration of the steps).
- Record information about the cause of the VM exit in the VM-exit information fields (exit reason, exit qualification, and guest address) and update VM-entry control fields.
- Save the processor state in the guest state area. This includes control registers, debug registers, machine-specific registers (MSRs), segment registers, descriptor-table registers, RIP, RSP, and RFLAGS, as well as nonregister states, such as pending debug exceptions.
- Save MSRs in the VM-exit MSR-store area. They are used to control and report on processor performance.
- Load the processor state based on the host-state area and some VM-exit controls. This includes host control registers, debug registers, MSRs, host table and descriptor-table registers, RIP, RSP, RFLAGS, page-directory pointer table entries, as well as nonregister states.
- Load MSRs from the VM-exit MSR-load area.
After the VMM has performed its system management function, a corresponding VM entry will be performed that transitions processor control from the VMM to the VM. Repeat the previous steps in reverse order. Now you can see why VM exits generate considerable overhead to the tune of hundreds or thousands of cycles for a single transition.
To mitigate this problem, considerable effort has gone into:
- Reducing the number of cycles required by a single transition
- Identifying options to reduce VM exits
Figure 2. VM exits to VM entry transactions
For more information, see the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3C: System Programming Guide, Part 3.
VM Exit HLT
One or more virtual central processing units (vCPUs) are assigned to every virtual machine. The threads inside the VM are assigned to the vCPUs and then they are processed. vCPU threads are scheduled on physical CPUs. The physical CPU goes into a sleep state when it is not active. When a vCPU has no threads to process and is in idle state, the guest operating system sends a request to halt the vCPU. When the vCPU is halted, there is a switch from the VM to the VMM. Subsequently, the VMM schedules out the vCPU and runs another vCPU thread on the physical CPU in which the halted vCPU thread was running. Processors support advanced programmable interrupt controller virtualization (APICv) that enables the system to introduce interrupts in the VM without causing switching between VM and VMM for MSR writes to the interrupt command register (ICR). If a wakeup interrupt is generated for a halted vCPU, the VMM needs to stop the other vCPU thread running on the physical CPU. This introduces a switch between VM and VMM to run the halted vCPU to process the interrupt. Switching between the VMM and the host, additional context switching, and wakeup IPI interrupts when crossing sockets are expensive. Physical CPU idle and vCPU idle have a significant performance difference. Frequent sleeps and wakeups cause high scheduling overhead and switching between VM and VMM that, in turn, negatively impact performance.
Approach
Fulfilling each root request as it comes would be very expensive, hence the VM exits polling mechanism is available. There are two ways to minimize overhead due to a VM exit HLT event: KVM halt polling technique and guest halt polling technique.
For more information, see the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3C: System Programming Guide, Part 3.
KVM Halt Polling Technique
With the KVM halt polling technique, when an HLT VM exit is triggered in the guest, the VMM may poll the vCPU for a time and wait for an interrupt for vCPU to arrive instead of scheduling the vCPU. The pCPU triggers a context switch. This polling time is preset. Our work here is to determine the optimal value for KVM polling parameters that provide significant performance for VM workloads and power savings. Refer to Figure 3.
Figure 3. Polling Technique and HLT VM exit flow
The KVM module has four tunable module parameters (halt_poll_ns, halt_poll_ns_grow, halt_poll_ns_grow_start, and halt_poll_ns_shrink) to adjust the global max polling interval as well as the rate at which the polling interval is grown and shrunk. These module parameters can be set from the debugfs files in /sys/module/kvm/parameters/. Based on the behavior of VM workloads, we determined the optimal values for tunable parameters, and the results obtained are shown in the Results section.
For more information, see The KVM Halt Polling System.
Guest Halt Polling Technique
As shown in Figure 3, the guest halt polling technique implements polling in the guest even before control is handed over to the VMM. The guest idle halt polling parameter enables the vCPU to wait for a period (that is, polling) when it is idle, delaying the running of HLT instructions. If the business process on the vCPU is woken up during the short waiting time, subsequent scheduling is not required. The technique guarantees performance with a small marginal cost. At the same time, the technique uses an adaptive algorithm. This ensures that the additional cost generates effective benefits (performance improvement) so that guest idle halt polling can not only solve the performance problems of special scenarios, but also ensure that performance recovery will not occur in general scenarios. guest_halt_poll_ns is a hard limit of the guest idle halt polling technique. We recommend configuring the interval time reasonably according to the frequency of workload sleep and wake-up. The five parameters guest_halt_poll_ns, guest_halt_poll_allow_shrink, guest_halt_poll_grow, guest_halt_poll_shrink, and guest_halt_poll_grow_start are used to adjust the adaptive algorithm. These five parameters are optimally tuned for Android* and Crostini (Linux VM) workloads to obtain significant performance and power savings by reducing VM exits due to HLT events.
For more information, see Guest Halt Polling.
Results
Our evaluation platform is an ASUS Chromebook*. It uses a 13th-generation Intel® Core™ i7-1365U processor at 2.70 GHz with 12 cores. The processor base frequency is 1.8 GHz, and it can reach up to 5.2 GHz in Turbo mode. The memory available in the device is 32 GB. ChromeOS* version R123 with Android 13 is loaded on the device. We have ensured that internet speed test is run before collecting the data to confirm the internet bandwidth is the same while running the tests. For all performance and power assessments, a median of three iterations is used with variance removed.
In our analysis, we studied the behavior with default polling values and optimal polling values for gaming workloads run in the Borealis VM – Gaming VM Stack in ChromeOS. For an example using the Steam game Spaceship, see figures 4 and 5.
Figure 4. Borealis VM (Gaming VM) default guest halt polling time vs counts of vcpus
Figure 5. Optimized minimum polling time vs counts of vcpus
Figures 4 and 5 show the crosvm thread CPU allocation time. In figure 5, the crosvm vCPU thread receives less CPU time with the 50K_10K configuration, resulting in a performance improvement.
In a visual representation shown in figure 5, the CPU time allocation for vCPU, highlighted in green, demonstrates significantly accelerated processing in comparison to the baseline data.
Further, we debugged with different polling configurations to have minimum VM exits and halt polling, as shown in Table 1.
Table 1. VM Exit Report with Different Guest Polling Configuration
The table shows the VM exits with different polling start and minimum polling time values. HLT vm_exits are reduced by approximately 6% with guest halt polling enabled with 50K_10K configuration as compared to other combinations. In addition, the dmesg logs confirm that for the 50K_10K configuration, the number of successful polling events increased by 58%. Further, we ran 14 Steam games, and we got substantial improvement in average FPS of 10 games, ranging from 4% to 71%. Refer to Table 2 for the results.
Table 2. Steam* Games Performance Impact with Optimal Guest Halt Polling Values in Borealis VM
Summary
In this paper, we outlined the VM performance gap compared to native and overheads due to VM exits. In addition, we conducted experiments in guest halt polling optimization and their performance impact on workloads run in VM on ChromeOS. Our implementation achieved a 4% to 71% performance improvement for Borealis VM workloads and an average 6% reduction in VM exits due to HLT using the guest halt polling technique. These techniques can also be used in different VM environments in current and future generations of platforms for performance gains and power savings. This technique helps reduce CPU use and power consumption while ensuring that the VM can quickly resume work when needed, leading to better overall performance in virtualized environments.
Notices and Disclaimers
Tests document performance of components on a particular test in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit Performance Index.
Test Configuration
Platform: Brya Chromebook
Software: ChromeOS CPFE R123-15768.0.0
Hardware: Chromebooks with 13th-generation Intel Core i7 processors, 282 CPU configuration, 16 GB RAM
Features and benefits of technologies from Intel depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at the Intel website.