|
Xen* 3.0 architecture (Figure 1) has a small hypervisor kernel that deals with virtualizing the CPU, memory, and critical
I/O resources, such as the interrupt controller. Dom0 is a paravirtualized Linux* that has privileged access to all I/O
devices in the platform and is an integral part of any Xen-based system. Xen 3.0 also includes a control panel that
controls the sharing of the processor, memory, network, and block devices. Access to the control interface is limited to
Dom0. Multiple user domains, called DomainU (DomU) can be created to run paravirtualized guest OSs. Dom0 and DomU OSs
use hypercalls to request services from the Xen hypervisor.
When Intel® VT is used, fully virtualized domains can be created to run unmodified guest OSs. These fully virtualized
domains are given the special name of HVMs (hardware-based virtual machines). Xen presents to each HVM guest a
virtualized platform that resembles a classic PC/server platform with a keyboard, mouse, graphics display, disk, floppy,
CD-ROM, etc. This virtualized platform support is provided by the Virtual I/O Devices module.
In the following sections we describe the extensions to each of these Xen components.

Figure 1: Xen 3.0 architecture
click image for larger view
Control Panel
We have extended the control panel to support creating, controlling, and destroying HVM domains. The user can specify
configuration parameters such as the guest memory map and size, the virtualized disk location, network configuration,
etc.
The control panel loads the guest firmware into the HVM domain and creates the device model thread (explained later)
that will run in Dom0 to service input/output (I/O) requests from the HVM guest. The control panel also configures the
virtual devices seen by the HVM guest, such as the interrupt binding and the PCI configuration.
The HVM guest is then started, and control is passed to the first instruction in the guest firmware. The HVM guest
executes at native speed until it encounters an event that requires special handling by Xen.
Guest Firmware
The guest firmware (BIOS) provides the boot services and run-time services required by the OS in the HVM. This guest
firmware does not see any real physical devices. It operates on the virtual devices provided by the device models.
For VT-x, we are re-using the open source Bochs BIOS [5]. We extended the Bochs BIOS by adding Multi-Processor
Specification (MPS) tables [6], Advanced Configuration and Power Interface (ACPI) tables [7], including the Multiple
APIC Description Table (MADT). The BIOS and the early OS loader expect to run in real mode. To create the environment
needed by these codes, we configure the VT-x guest to execute in virtual-8086 mode. Instructions that cannot be executed
in this mode are intercepted and emulated with a software emulator.
For VT-i, we developed a guest firmware using the Intel® Platform Innovation Framework for Extensible Firmware Interface
(EFI). This guest firmware provides all EFI boot services required by IPF guest OSs. It is compatible with the
Developer's Interface Guide for 64-bit Intel® Architecture-based Servers (DIG64) and provides the System Abstraction
Layer (SAL), ACPI 2.0, and EFI 1.10 tables required by IPF guest OSs.
Processor Virtualization
The Virtual CPU module in Xen provides the abstraction of a processor to the HVM guest. It manages the virtual
processor(s) and associated virtualization events when the guest OS is executing. It saves the physical processor state
when the guest gives up a physical CPU, and restores the guest state when it is rescheduled to run on a physical
processor.
For the IA-32 architecture, a VMCS structure is created for each CPU in a HVM domain (Figure 2). The execution control
of the CPU in VMX mode is configured as follows:
-
Instructions such as CPUID, MOV from/to CR3, MOV to CR0/CR4, RDMSR, WRMSR, HLT, INVLPG, MOV from CR8, MOV DR, and
MWAIT are intercepted as VM exits.
-
Exceptions/faults, such as page fault, are intercepted as VM exits, and virtualized exceptions/faults are injected
on VM entry to guests.
-
External interrupts unrelated to guests are intercepted as VM exits, and virtualized interrupts are injected on VM
entry to the guests.
-
Read shadows are created for the guest CR0, CR4, and time stamp counter (TSC). Read accesses to such registers will
not cause VM exit, but will return the shadow values.

Figure 2: VMCS
For the Itanium® architecture, a Virtual Processor Block (VPD) structure is created for each CPU in a HVM domain. The VPD
has similar functionality as the VMCS in the IA-32 architecture. The virtualization control of the CPU is configured as
follows:
-
Instructions such as MOV from/to RR, MOV from/to CR, ITC/PTC, ITR/PTR, MOV from/to PKR, MOV from/to IBR/DBR are
intercepted as virtualization faults.
-
Instructions such as COVER, BSW are optimized to execute without virtualization faults.
-
Exceptions/faults are intercepted by the VMM, and virtualized exceptions/faults are injected to the guest on a VM
resume.
-
External interrupts are intercepted by the VMM, and virtualized external interrupts are injected to the guest using
the virtual external interrupt optimization.
-
Read shadows are created for the guest interruption control registers, PSR, CPUID. Read accesses to such registers
will not cause virtualization fault, but will return the shadow values.
-
Write shadows are created for the guest interruption control registers. Write accesses to such registers will not
cause virtualization fault, but will write to the shadow values.
An interesting question when designing Xen concerns the processor features that are exposed to HVM guests. Some VMMs
present only a generic, minimally featured processor to the guest. This allows the guest to migrate easily to arbitrary
platforms, but precludes the guest from using new instructions or processor features that may exist in the processor.
For Xen, we are exporting most CPUID bits to the guest. We clearly need to clear the VMX bit [Leaf 1, ECX:5], or else
the guest may bring up another level of virtualization. Other bits to be cleared include machine check architecture
(MCA), because MCA issues are handled by the hypervisor. Today's OSs also use model-specific registers to detect the
microcode version on the processor and to decide whether they need to perform a microcode update. For Xen, we decided to
fake the update request, i.e., bump the microcode version number without changing the microcode itself.
Memory Virtualization
The virtual Memory Management Unit (MMU) module in the Xen hypervisor presents the abstraction of a hardware MMU to the
HVM domain. HVM guests see guest physical addresses (GPAs), and this module translates GPAs to the appropriate machine
physical addresses (MPAs).
IA-32 Memory Virtualization
The virtual MMU module supports all page table formats that can be used by the guest OS.
-
For IA-32
-
it supports 2-level page tables with 4 KB page size for 32-bit guests.
-
For IA-32 Physical Address Extension (PAE)
-
it supports 2-level page tables with 4 KB page sizes for 32-bit guests.
-
it supports 3-level page tables with 4 KB and 2 MB page sizes for 32-bit PAE guests.
-
For Intel® EM64T
-
it supports 2-level page tables with 4 KB page size for 32-bit guests.
-
it supports 3-level page tables with 4 KB and 2 MB page sizes for 32-bit PAE guests.
-
it supports 4-level page tables with 4 KB and 2 MB page sizes for 64-bit guests.
For the IA-32 architecture, this module maintains a shadow page table for the guest (Figure 3). This is the actual page
table used by the processor during VMX operation, containing page table entries (PTEs) with machine page-frame numbers.
Every time the guest modifies its page mapping, either by changing the content of a translation, creating a new
translation, or removing an existing translation, the virtual MMU module will capture the modification and adjust the
shadow page tables accordingly. Since Xen already has shadow page table code for paravirtualized guests, we extended the
code to support fully virtualization guests. The resultant code handles paravirtualized and unmodified guests in a
unified fashion.

Figure 3: Shadow page table
click image for larger view
From a performance point of view, the shadow page table code is the most critical for overall performance. The most
rudimentary implementation includes the construction of shadow page tables from scratch every time the guest updates CR3
to request a TLB flush. This, however, will incur significant overhead. If we can tell which guest page table entries
have been modified, we just need to clean up the affected shadow entries, allowing the existing shadow page tables to be
reused.
The following algorithm is used to optimize shadow page table management:
-
When allocating a shadow page upon page fault from the guest, write protect the corresponding guest page table page.
This allows you to detect any attempt to modify the guest page table. For this to work, you need to find all
translations that map the guest page table page. There are several optimizations for this as discussed below.
-
Upon page fault against a guest page table page, save a "snapshot" of the page and give write permission to the
page. The page is then added to an "out of sync" list with the information on such an attempt (i.e., which address,
etc.). Now the guest can continue to update the page.
-
When the guest executes an operation that results in the flush TLB operation, reflect all the entries on the "out of
sync list" to the shadow page table. By comparing the snapshot and the current page in the guest page table, you can
update the shadow page table efficiently by checking if the page frame numbers in the guest page tables are valid (i.e.,
contained in the domain).
Itanium Processor Architecture Memory Virtualization

Figure 4: IPF TLB virtualization
click image for larger view
The Itanium processor architecture defines Translation Register (TR) entries that can be used to statically map a range
of virtual addresses to physical addresses. Translation Cache (TC) entries are used for dynamic mappings. Address
translation entries can reside in either the TLB or in a Virtual Hash Page Table (VHPT). On a TLB miss, a hardware
engine will walk the VHPT to extract the translation entry for the referenced address and insert the translation into
the TLB.
Figure 4 illustrates the TLB virtualization logic in Xen. We extended the Xen hypervisor to capture all TLB insertions
and deletions initiated by a guest OS. This information is used to maintain the address translation for the guest. Two
new data structures are added to Xen:
-
The Machine VHPT is a per virtual CPU data structure. It is maintained by the hypervisor and tracks the translations
for guest TR and TC entries mapping normal memory. It is walked by the hardware VHPT walker on a TLB miss.
The Itanium processor architecture defines two formats for the VHPT. The short-format VHPT is meant to be used by an OS
to implement linear page tables. The long-form VHPT has a larger foot print but supports protection keys and collision
chains. We have extended the Xen hypervisor to use the long-form VHPT.
-
The guest software TLB structure is used to track guest TRs and TCs mapping memory mapped I/O addresses or less than
preferred page table entries. Access to these addresses must be intercepted and forwarded to the device model.
Region Identifier (RID) is an important component of the Itanium architecture virtual memory management system. It is
used to uniquely identify a region of virtual address. Per Itanium architecture specifications, RID should have at least
18 bits and at most 24 bits. The exact number of RID bits implemented by a processor can be found by using the
PAL_VM_SUMMARY call. An address lookup will require matching the RID as well as the virtual address.
Each IPF guest OS thinks it has unique ownership of the RIDs. If you allow two VT-i domains to run on the same processor
with the same RID, you need to flush the machine TLB whenever a domain is switched out. This will have a significant
negative impact on system performance.
The solution we used for Xen is to partition the RIDs between the domains. Specifically, we reserved several high-order
bits from the RID as the guest identifier. The machine RID used for the guest is then a concatenation of the guest ID
and the RID managed by the guest itself.
Machine_rid=guest_rid + (guest_id << 18)
As an illustration, if we have a CPU that support a 24-bit RID, the guest firmware inside the VT-i guest will report
only 18-bit RID to the guest. The actual 24-bit RID installed into the machine will have the guest identifier in the
upper 6-bit.
We also need two more RIDs per domain for guest physical mode emulation. The guest physical mode accesses are emulated
by using a virtual address with special RIDs. This restricts the total number of IPF guests to 63.
This is a reasonable solution when the number of concurrent guests is limited and the guests are not running millions of
processes concurrently. A more elaborate scheme is needed if this assumption is not true.
Device Virtualization
Figure 5 illustrates the device virtualization logic in Xen. The Virtual I/O devices (device models) in Dom0 provide the
abstraction of a PC platform to the HVM domain. Each HVM domain sees an abstraction of a PC platform with a keyboard,
mouse, real-time clock, 8259 programmable interrupt controller, 8254 programmable interval timer, CMOS, IDE disk,
floppy, CDROM, and VGA/graphics.
To reduce the development effort, we reuse the device emulation module from the open source QEMU project [8]. Our basic
design is to run an instance of the device models in Dom0 per HVM domain. Performance critical models like the
Programmable Interrupt Timer (PIT) and the Programmable Interrupt Controller (PIC), are moved into the hypervisor.

Figure 5: I/O Device virtualization
The primary function of the device model is to wait for an I/O event from the HVM guest and dispatch it to the
appropriate device emulation model. Once the device emulation model completes the I/O request, it will respond back with
the result. A shared memory between the device model and the Xen hypervisor is used for communication of I/O request and
response.
The device model utilizes Xen's event channel mechanism and waits for events coming from the HVM domain via an event
channel, with appropriate timeouts to support the internal timer mechanisms within these emulators.
I/O Port Accesses
We set up the I/O bitmap to intercept I/O port accesses by the guest. At each such VM exit, we collect exit
qualification information such as port number, access size, direction, string or not, REP prefixed or not, etc. This
information is packaged as an I/O request packet and sent to the device model in Dom0.
Following is an example of an I/O request handling from a HVM guest:
-
VM exit due to an I/O access.
-
Decode the instruction.
-
Make an I/O request packet (ioreq_t) describing the event.
-
Send the event to the device model in Dom0.
-
Wait for response for the I/O port and MMIO operation from the device model.
-
Unblock the HVM domain.
-
VMRESUME back to the guest OS.
Although this design significantly reduced our development efforts, almost all I/O operations require domain switches to
Dom0 to run the device model, resulting in high CPU overhead and I/O latencies. To give HVM domains better I/O
performance, we also ported Xen's Virtual Block Device (VBD) and Virtual Network Interface (VNIF) to HVM domains.
Memory-Mapped I/O Handling
Most devices require memory-mapped I/O to access the device registers. Critical interrupt controllers, such as I/O APIC,
also require memory-mapped I/O access. We intercept these MMIO accesses as page faults.
On each VM exit due to page fault, you need to do the following:
-
Check the PTE to see if the guest page-frame belongs to the MMIO range.
-
If so, decode the instruction and send an I/O request packet to the device model in Dom0.
-
Otherwise, hand the event to the shadow page code for handling.
The Itanium processor family supports memory-mapped I/O only. It implements the above logic in the page fault handler.
Interrupts Handling
The real local APICs and I/O APICs are owned and controlled by the Xen hypervisor. All external interrupts will cause VM
exits. Interrupts owned by the hypervisor (e.g., the local APIC timer) are handled inside the hypervisor. Otherwise the
handler in Dom0 is used if the interrupt is not used by the hypervisor. This way the HVM domain does not handle real
external interrupts.
The HVM guests only see virtualized external interrupts. The device models can trigger a virtual external interrupt by
sending an event to the interrupt controller (PIC or APIC) device model. The interrupt controller device model then
injects a virtual external interrupt to the HVM guest on the next VM entry.
Virtual Device Drivers
The VBD and VNIF are based on a split driver pair where the front-end driver runs inside a guest domain while the
backend driver runs inside Dom0 or an I/O VM. To port these drivers to HVM domains, we have to solve two major
challenges:
-
Define a way to allow the hypervisor to access data inside the guest, based on a guest virtual address.
We solved this problem by defining a copy_from_guest() hypercall that will walk the guest's page table and map the
resulting physical pages into the hypervisor address space.
-
Define a way to signal Xen events to the virtual drivers. This must be done in a way that is consistent with the
guest OSs device driver infrastructure.
We solved this problem by implementing the driver as a fake PCI device driver with its own interrupt vector. This vector
is communicated to the hypervisor via a hypercall. Subsequently, the hypervisor will use this vector to signal an event
to the virtual device driver.
The send performance of the VNIF ported this way approximates that of the VNIF running in paravirtualized DomU. The
receive throughput is lower. We are continuing our investigation.
|