|
Established and emerging uses provide strong motivation for improving virtualization support in both server
and client computing systems. Unfortunately, the IA-32 and Itanium® architectures present many challenges to
providing such support. Software techniques exist that address some of those challenges.
Challenges to virtualizing Intel® Architectures
Intel microprocessors (both IA-32 and Itanium® architecture) provide protection based on the concept of a 2bit
privilege level, using 0 for most-privileged software and 3 for least-privileged. The privilege level
determines whether privileged instructions, which control basic CPU functionality, can execute without fault.
It also controls address-space accessibility based on the configuration of the processor's page tables and,
for IA-32, segment registers. Most IA software uses only privilege levels 0 and 3.
For an OS to control the CPU, some of its components must run with privilege level 0. Because a VMM cannot
allow a guest OS such control, a guest OS cannot execute at privilege level 0. Thus, VMMs running on either
IA-32 or Itanium processors must use ring deprivileging, a technique that runs all guest software at a
privilege level greater than 0. A guest OS could be deprivileged in two distinct ways: it could run either at
privilege level 1 (the 0/1/3 model) or at privilege level 3 (the 0/3/3 model).
Although the 0/1/3 model supports simpler VMMs, it cannot be used for guests on IA-32 processors in 64-bit
mode (more details in "ring compression" section). (64-bit mode is part of Intel® Extended Memory 64
TechnologyΦIntel® EM64Tthe 64-bit extensions to IA-32.)
Ring aliasing
Ring aliasing refers to problems that arise when software is run at a privilege level other than the
privilege level for which it was written.
An example in IA-32 involves the CS segment register, which points to the code segment. If the PUSH
instruction is executed with the CS segment register, the contents of that register (which include the
current privilege level) is pushed on the stack. Similarly, the Itanium instruction br.call saves the current
privilege level into the ppl field of the Previous Function State (PFS) register, which can be read at any
privilege level. In either case, a guest OS could easily determine that it is not running at privilege level
0.
Address-space compression
OSs expect to have access to the processor's full virtual-address space (known as the linear-address space in
IA-32). A VMM must reserve for itself some portion of the guest's virtual-address space. It could run
entirely within the guest's virtual-address space, which allows it easy access to guest data, but the VMM's
instructions and data structures would use a substantial amount of the guest's virtual-address space.
Alternatively, the VMM can run in a separate address space, but even in that case, the VMM must use a minimal
amount of the guest's virtual-address space for the control structures that manage transitions between guest
software and the VMM. For IA-32, these structures include the interrupt-descriptor table (IDT) and the
global-descriptor table (GDT), which reside in the linear-address space. For the Itanium architecture, the
structures include the interruption vector table (IVT), which resides in the virtual-address space.
The VMM must prevent guest access to those portions of the guest's virtual-address space that the VMM is
using. Otherwise, the VMM's integrity could be compromised (if the guest can write to those portions) or the
guest could detect that it is running in a VM (if it can read those portions). Guest attempts to access these
portions of the address space must generate transitions to the VMM, which can emulate or otherwise support
them. The term address-space compression refers to the challenges of protecting these portions of the
virtual-address space and supporting guest accesses to them.
Non-faulting access to privileged state
Privilege-based protection prevents unprivileged software from accessing certain components of CPU state. In
most cases, attempted accesses result in faults, allowing a VMM to emulate the desired guest instruction.
However, the IA-32 and Itanium architectures both include instructions that access privileged state and do
not fault when executed with insufficient privilege. For example, the IA-32 registers GDTR, IDTR, LDTR, and
TR contain pointers to data structures that control CPU operation. Software can execute the instructions that
write to, or load, these registers (LGDT, LIDT, LLDT, and LTR) only at privilege level 0. However, software
can execute the instructions that read, or store, from these registers (SGDT, SIDT, SLDT, and STR) at any
privilege level. If the VMM maintains these registers with unexpected values, a guest OS using the latter
instructions could determine that it does not have full control of the CPU.
Another example pertains to the page-table address (PTA) register of the Itanium architecture, a field that
references the base address of the virtual hash page table (VHPT). The instruction mov to cr.PTA is the
normal way to access this register, and software can execute it only at privilege level 0. However, the thash
instruction indirectly exposes all or part of the VHPT base address, and software can execute it at any
privilege level. If the VMM maintains the VHPT at a different address than the guest OS expects, a guest OS
using the thash instruction could determine that it does not have full control of the CPU.
Adverse impact on guest system calls
Ring deprivileging can interfere with the effectiveness of facilities in the IA-32 architecture that
accelerate the delivery and handling of transitions to OS software. The IA-32 SYSENTER and SYSEXIT
instructions support low-latency system calls. SYSENTER always effects a transition to privilege level 0, and
SYSEXIT faults if executed outside that ring. Ring deprivileging thus has the following implications:
-
Executions of SYSENTER by a guest application cause transitions to the VMM and not to the guest OS.
The VMM must emulate every guest execution of SYSENTER.
-
Executions of SYSEXIT by a guest OS cause faults to the VMM. The VMM must emulate every guest
execution of SYSEXIT.
Interrupt virtualization
Providing support for external interrupts, especially regarding interrupt masking, presents some specific
challenges to VMM design. Both the IA-32 and Itanium architectures provide mechanisms for masking external
interrupts thus preventing their delivery when the OS is not ready for them. IA-32 uses the interrupt flag
(IF) in the EFLAGS register to control interrupt masking; the Itanium architecture uses the i bit in the
processor status register (PSR) to provide this function. In both cases, a value of 0 indicates that
interrupts are masked.
A VMM will likely manage external interrupts and deny guest software the ability to control interrupt
masking. Existing protection mechanisms allow such denial of control by ensuring that guest attempts to
control interrupt masking fault in the context of ring deprivileging. Such faulting can cause problems
because some OSs frequently mask and unmask interrupts. Intercepting every guest attempt to do so could
significantly affect system performance.
Even if it were possible to prevent guest modifications of interrupt masking without intercepting each
attempt, challenges would remain when a VMM has a "virtual interrupt" to deliver to a guest. A virtual
interrupt should be delivered only when the guest has unmasked interrupts. To deliver virtual interrupts in a
timely way, a VMM should intercept some but not all attempts by a guest to modify interrupt masking. Doing so
could significantly complicate the design of a VMM.
Access to hidden state
Some components of IA-32 and Itanium processor state are not represented in any software-accessible register.
Examples for IA-32 include the hidden descriptor caches for the segment registers. A segment-register load
copies the referenced descriptor (from the GDT or LDT) into this cache, which is not modified if software
later writes to the descriptor tables. IA-32 does not provide a mechanism for saving and restoring hidden
components of a guest context when changing VMs or for preserving them while the VMM is running.
In the Itanium architecture, there is a field in the Register Stack Engine (RSE) called the current frame
load enable (CFLE). There is no direct way to write this value. There are cases where the VMM may take an
external interrupt and wants to return to the guest OS with this value equal to zero. The return from
interrupt (rfi) instruction forces this value to a one.
Ring compression
Ring deprivileging uses privilege-based mechanisms to protect the VMM from guest software. IA-32 includes two
such mechanisms: segment limits and paging. Because segment limits do not apply in 64-bit mode, paging must
be used in this mode. Because IA-32 paging does not distinguish privilege levels 02, the guest OS must run
at privilege level 3 (the 0/3/3 model). Thus, the guest OS runs at the same privilege level as guest
applications and is not protected from them. This problem is called ring compression.
Frequent access to privileged resources
A VMM may prevent guest access to privileged resources by forcing attempts at such accesses to fault. Even
when this ensures correct behavior, performance may be compromised if the frequency of such faults is
excessive.
In the IA-32 and Itanium architectures, an example involves the task-priority register (TPR). For the IA-32
architecture, this register is located in the advanced programmable interrupt controller (APIC), and for the
Itanium architecture, it is one of the control registers. Because it controls interrupt prioritization, a VMM
must not allow a guest OS access to the TPR. However, some OSs perform such accesses with very high
frequency. These accesses require VMM intervention only if they cause the TPR to drop below a value
determined by the VMM.
The Itanium architecture supports efficient interruption handlers by providing them with information about
the interruption and the interrupted context. These data are recorded, not in memory, but in a set of
interruption-control registers. The processor protects system integrity by generating faults in response to
accesses to those registers outside privilege level 0. Typically, every interruption handler reads these
registers. If each such access generates a fault to the VMM, the performance of these handlers will be
severely compromised.
|