|
In this section we describe the performance tuning exercise done to date for VT-x guests. The classic approach is to run
a synthetic workload inside an HVM domain and compare the performance against the same workload running inside an
identically configured paravirtualized domain. But to understand why the domain operates the way it does, we have to
extend tools like Xentrace and Xenoprof to support HVM domains also.
Xentrace is a tool that can be used to trace events in the hypervisor. It can be used to count the occurrence of key
events and their handling time. We extended this tool to trace VT-x specific information such as VM exits, recording the
exit cause and the handling time.
Xenoprof is a port of OProfile to the Xen environment. It is a tool that uses hardware performance counters to track
clock cycle count, instruction retirements, TLB misses, and cache misses. Each time a counter fires, Xenoprof samples
the program counter, thus allowing a profile to be built for the program hotspots. The original Xenoprof supports
paravirtualized guests only. We extended this tool to support HVM domains.
A typical tuning experiment proceeds as follows:
-
Run a workload and use Xentrace to track the VM exit events occurring during the run.
-
Run a workload and use Xenoprof to profile the hotspots in the hypervisor.
We observed the bulk of the exits is caused by I/O instruction or shadow page table operations. I/O instructions have
the longest handling time, requiring a context switch to Dom0. At one stage of our tuning experiment, 40% of the
hypervisor time was spent in the shadow code.
Based on the above findings, we focused on tuning the I/O handler code and improving the shadow page handling.
-
From the Xentrace result, we observed that the majority of the guest I/O accesses are to the PIC ports. This is
because the guest timer handler needs regular access to PIC ports. By moving the PIC model to the hypervisor, we
dramatically reduced the PIC handling time. Kernel build performance improved 14% and the CPU2k benchmark improved by
7%.
-
The original QEMU IDE model handles IDE DMA operations in a synchronous fashion. When a guest starts an IDE DMA
operation, the QEMU model will wait for the host to complete the DMA request. We added a new thread to handle DMA
operations in an asynchronous fashion. This change increased guest kernel build performance by 8%.
-
The original QEMU NIC model is implemented using a polling loop. We changed the code to an event driven design that
will wait on the packet file descriptors. This change improved SCP performance by 1040 times.
-
The original QEMU VGA model emulated a graphics card. When the guest updates the screen, each video memory write
causes a VM exit, and pixel data have to be forwarded to a VGA model in Dom0. To speed up graphics performance, we
implemented a shared memory area between the QEMU model and the HVM guest. Guest video memory write will no longer cause
a VM exit. The VGA model will update the screen periodically using data in the shared memory area. This improved XWindow
performance dramatically by 51000 times.
|