|
When virtualizing an I/O device, it is necessary for the underlying virtualization software to service several types of
operations for that device. Interactions between software and physical devices include the following:
-
Device discovery: a mechanism for software to discover, query, and configure devices in the platform.
-
Device control: a mechanism for software to communicate with the device and initiate I/O operations.
-
Data transfers: a mechanism for the device to transfer data to and from system memory. Most devices support DMA in
order to transfer data.
-
I/O interrupts: a mechanism for hardware to be able to notify the software of events and state changes.
Each of these interactions is discussed, covering implementation, challenges, advantages, and disadvantages of each of
the common virtualization techniques. The VMM could be a single monolithic software stack or could be a combination of a
hypervisor and specialized guests (as shown in Figure 1). The type of VMM architecture used is independent of the
concepts discussed in this section, but will become relevant later in our discussion.
Emulation
I/O mechanisms on native (non-virtualized) platforms are usually performed on some type of hardware device. The software
stack, commonly a driver in an OS, will interface with the hardware through some type of memory-mapped (MMIO) mechanism,
whereby the processor issues instructions to read and write specific memory (or port) address ranges. The values read
and written correspond to direct functions in hardware.
Emulation refers to the implementation of real hardware completely in software. Its greatest advantage is that it does
not require any changes to existing guest software. The software runs as it did in the native case, interacting with the
VMM emulator just as though it would with real hardware. The software is unaware that it is really talking to a
virtualized device. In order for emulation to work, several mechanisms are required.
The VMM must expose a device in a manner that it can be discovered by the guest software. An example is to present a
device in a PCI configuration space so that the guest software can "see" the device and discover the memory addresses
that it can use to interact with the device.
The VMM must also have some method for capturing reads and writes to the device's address range, as well as capturing
accesses to the device-discovery space. This enables the VMM to emulate the real hardware with which the guest software
believes it is interfacing.
The device (usually called a device model) is implemented by the VMM completely in software (see Figure 2). It may be
accessing a real piece of hardware in the platform in some manner to service some I/O, but that hardware is independent
of the device model. For example, a guest might see an Integrated Drive Electronics (IDE) hard disk model exposed by the
VMM, while the real platform actually contains a Serial ATA (SATA) drive.

Figure 2: Device emulation model
The VMM must also have a mechanism for injecting interrupts into the guest at appropriate times on behalf of the
emulated device. This is usually accomplished by emulating a Programmable Interrupt Controller (PIC). Once again, when
the guest software exercises the PIC, these accesses must be trapped and the PIC device modeled appropriately by the
VMM. While the PIC can be thought of as just another I/O device, it has to be there for any other interrupt-driven I/O
devices to be emulated properly.
Emulation facilitates migration of VMs from one platform to another. Since the devices are purely emulated and have no
ties to physical devices in the platform, it is easy to move a VM to another platform where the VMM can support the
exact same emulated devices. If the guest VM did have some tie to any platform physical devices, those same physical
devices would need to be present on any platform to which the VM was migrated.
Emulation also facilitates the sharing of platform physical devices of the same type, because there are instances of an
emulation model exposed to potentially many guests. The VMM can use some type of sharing mechanism to allow all guest's
emulation models access to the services of a single physical device. For example, the traffic from many guests with
emulated network adapters could be bridged onto the platform's physical network adapter.
Since emulation presents to the guest software the exact interface of some existing physical hardware device, it can
support a number of different guest OSs in an OS-independent manner. For example, if a particular storage device is
emulated completely, then it will work with any software written for that device, independent of the guest OS, whether
it be Windows*, Linux*, or some other IA-based OS. Since most modern OSs ship with drivers for many well-known devices,
a particular device make and model can be selected for emulation such that it will be supported by these existing legacy
environments.
While emulation's greatest advantage is that there are no requirements to modify guest device drivers, its largest
detractor is low performance. Each interaction of the guest device driver with the emulated device hardware requires a
transition to the VMM, where the device model performs the necessary emulation, and then a transition back to the guest
with the appropriate results. Depending upon the type of I/O device that is being emulated, many of these transactions
may be required to actually retrieve data from the device. These activities add considerable overhead compared to normal
software-hardware interactions in a non-virtualized system. Most of this new overhead is compute-bound in nature and
increases CPU utilization. The timing involved in each interaction can also accumulate to increase overall latency.
Another disadvantage of emulation is that the device model needs to emulate the hardware device very accurately,
sometimes to the revision of the hardware, and must cover all corner cases. This can result in the need for "bug
emulation" and problems arising with new revisions of hardware.
Paravirtualization
Another technique for virtualizing I/O is to modify the software within the guest, an approach that is commonly referred
to as paravirtualization [4, 8]. The advantage of I/O paravirtualization is better performance. A disadvantage is that
it requires modification of the guest software, in particular device drivers, which limits its applicability to legacy
OS and device-driver binaries.
With paravirtualization (see Figure 3) the altered guest software interacts directly with the VMM, usually at a higher
abstraction level than the normal hardware/software interface. The VMM exposes an I/O type-specific API, for example, to
send and receive network packetsin the case of a network adaptor. The altered software in the guest then uses this VMM
API instead of interacting directly with a hardware device interface.
Paravirtualization reduces the number of interactions between the guest OS and VMM, resulting in better performance
(higher throughput, lower latency, reduced CPU utilization), compared to device emulation.
Instead of using an emulated interrupt mechanism, paravirtualization uses an eventing or callback mechanism. This again
has the potential to deliver better performance, because interactions with a PIC hardware interface are eliminated, and
because most OS's handle interrupts in a staged manner, adding overhead and latency. First, interrupts are fielded by a
small Interrupt Service Routine (ISR). An ISR usually acknowledges the interrupt and schedules a corresponding worker
task. The worker task is then run in a different context to handle the bulk of the work associated with the interrupt.
With an event or callback being initiated directly in the guest software by the VMM, the work can be handled directly in
the same context. With some implementations, when the VMM wishes to introduce an interrupt into the guest, it must force
the running guest to exit to the VMM, where any pending interrupts can be picked up when the guest is reentered. To
force a running guest to exit, a mechanism like IPI can be used. But this again adds overhead compared to a direct
callback or event. Again, the largest detractor to this approach is that the interrupt handling mechanisms of the guest
OS kernel must also be altered.

Figure 3: Device paravirtualization
Since paravirtualization involves changing guest software, usually the changed components are specific to the guest
environment. For instance, a paravirtualized storage driver for Windows XP* will not work in a Linux environment.
Therefore, a separate paravirtualized component must be developed and supported for each targeted guest environment.
These changes require apriori knowledge of which guest environments will be supported by a particular VMM.
As with device emulation, paravirtualization is supportive of VM migration, provided that the VM is migrated to a
platform that supports the same VMM APIs required by the guest software stack.
Sharing of any platform physical devices of the same type is supported in the same manner as emulation. For example,
guests using a paravirtualized storage driver to read and write data could be backed by stores on the same physical
storage device managed by the VMM.
Paravirtualization is increasingly deployed to satisfy the performance requirements of I/O-intensive applications.
Paravirtualization of I/O classes that are performance sensitive, such as networking, storage, and high-performance
graphics, appears to be the method of choice in modern VMM architecture. As described, para-virtualization of I/O
decreases the number of transitions between the client VM and the VMM, as well as eliminates most of the processing
associated with device emulation.
Paravirtualization leads to a higher level of abstraction for I/O interfaces within the guest OS. I/O-buffer allocation
and management policies that are aware of the fact that they are virtualized can be used for more efficient use of the
VT-d protection and translation facilities than would be possible with an unmodified driver that relies on full device
emulation.
At least three of the major VMM vendors have adopted the capability to paravirtualize I/O in order to accomplish greater
scaling and performance. Xen* and VMware already have the ability to run paravirtualized I/O drivers and Microsoft's
plans include I/O paravirtualization in its next-generation VMM.
Direct Assignment
There are cases where it is desirable for a physical I/O device in the platform to be directly owned by a particular
guest VM. Like emulation, direct assignment allows the owning guest VM to interface directly to a standard device
hardware interface. Therefore, direct device assignment provides a native experience for the guest VM, because it can
reuse existing drivers or other software to talk directly to the device.
Direct assignment improves performance over emulation because it allows the guest VM device driver to talk to the device
in its native hardware command format eliminating the overhead of translating from the device command format of the
virtual emulated device. More importantly, direct assignment increases VMM reliability and decreases VMM complexity
since complex device drivers can be moved from the VMM to the guest.
Direct assignment, however, is not appropriate for all usages. First, a VMM can only allocate as many devices as are
physically present in the platform. Second, direct assignment complicates VM migration in a number of ways. In order to
migrate a VM between platforms, a similar device type, make, and model must be present and available on each platform.
The VMM must also develop methods to extract any physical device state from the source platform, and to restore that
state at the destination platform.
Moreover, in the absence of hardware support for direct assignment, direct assignment fails to reach its full potential
in improving performance and enhancing reliability. First, platform interrupts may still need to be fielded by the VMM
since it owns the rest of the physical platform. These interrupts must be routed to the appropriate guestin this case
the one that owns the physical device. Therefore, there is still some overhead in this relaying of interrupts. Second,
existing platforms do not provide a mechanism for a device to directly perform data transfers to and from the system
memory that belongs to the guest VM in an efficient and secure manner. A guest VM is typically operating in a subset of
the real physical address space. What the guest VM believes is its physical memory really is not; it is a subset of the
system memory virtualized by the VMM for the guest. This addressing mismatch causes a problem for DMA-capable devices.
Such devices place data directly into system memory without involving the CPU. When the guest device driver instructs
the device to perform a transfer it is using guest physical addresses, while the hardware is accessing system memory
using host physical addresses.
In order to deal with the address space mismatch, VMMs that support direct assignment may employ a pass-through driver
that intercepts all communication between the guest VM device driver and the hardware device. The pass-through driver
performs the translation between the guest physical and real physical address spaces of all command arguments that refer
to physical addresses. Pass-through drivers are device-specific since they must decode the command format for a specific
device to perform the necessary translations. Such drivers perform a simpler task than traditional device drivers;
therefore, performance is improved over emulation. However, VMM complexity remains high, thereby impacting VMM
reliability. Still, the performance benefits have proven sufficient to employ this method in VMMs targeted to the server
space, where it is acceptable to support direct assignment for only a relatively small number of common devices.
VMM Software Architecture Implications
Different I/O virtualization methods are not equally applicable to all VMM software architecture options.
Emulation is the most general I/O virtualization method, able to expose standard I/O devices to an unmodified guest OS.
Accordingly, it is widely employed in existing OS-hosted, stand-alone hypervisor or hybrid VMM implementations.
As already mentioned, paravirtualization is increasingly being deployed in many VMMs to improve performance for common
guests. It is readily applicable to stand-alone hypervisor VMMs. It can also be used in the interaction between the
guest OS and the ULM in an OS-hosted VMM or can be used in the guest OS and the service VM in a hybrid VMM.
Direct assignment is used in cases where the guest OS cannot be modified either because it is difficult to do so or the
paravirtualized guest device drivers are not qualified for a specific application. However, it is difficult to introduce
direct assignment in an OS-hosted VMM since in general, such VMMs do not own real platform devices and do not maintain
device drivers for such devices. On the other hand, direct assignment naturally reduces complexity in stand-alone
hypervisor and hybrid VMMs since device drivers can be moved to the guest OS or service OSs, respectively. This reduced
complexity is not possible with either emulation or paravirtualization.
As our discussion suggests, it is quite likely that a VMM can employ many different techniques for I/O virtualization
concurrently. For instance, in the context of hybrid VMM, direct assignment might be used to assign a platform physical
device to a particular guest VM, whose responsibility it is to share that device with many guests. Depending upon the
needs and requirements of the guest, it may offer both emulated device models, as well as paravirtualized solutions to
the different guests. A common configuration is to provide paravirtualized solutions for the most common guest
environments, while an emulation solution is offered to support all other legacy environments.
IOVM Architecture
A major emerging trend among developers of virtualization software, in particular for I/O processing and sharing, is the
VMM system decomposition.
The trend for the software architecture of VMMs is to move from a monolithic hypervisor model towards a software
architecture that decomposes the VMM into a very thin privileged "micro-hypervisor" that resides just above the physical
hardware, and one or more special-purpose VMs that are de-privileged relative to the hypervisor, and are responsible for
services and policy. With regard to I/O virtualization, these deprivileged components of the VMM can be responsible for
I/O processing and I/O resource sharing. We call this general architecture the "IOVM" model (see Figure 4). The IOVM
model is a generalization of the hybrid VMM architecture in that I/O devices can be allocated to different service VMs
specialized for the specific I/O function (e.g., network VM, storage VM, etc.).
Two major benefits of the IOVM model are the ability to use unmodified device drivers within the IOVM and the isolation
of the physical device and its driver(s) from the other guest OSs, applications, and hypervisor. The use of unmodified
drivers is possible because these drivers can run in a separate OS environment, in contrast to a monolithic hypervisor
where new drivers are often written for the VMM environment. The isolation of the device and its driver protect the
guest VMs from driver crashes, that is, the IOVM may crash due to a driver failure without severely affecting the guest
OSs. A disadvantage of the IOVM model is that there is additional overhead incurred, due to additional communication and
data movement between the guest OS and the IOVM. This performance penalty can be offset by paravirtualizing the
interface of the IOVM, thus minimizing the number of interactions. The Xen VMM has implemented this architecture as
"Isolated Driver Domains" [6], and Microsoft is in the process of developing a version of this architecture in their
next generation of VMMs [7].
Direct assignment of I/O devices to IOVMs directly facilitates this usage model and is becoming increasingly important
as VMMs are transitioning to this architecture. As we have seen, however, software by itself is not capable of fully
protecting the system from errant DMA traffic between the I/O device and system memory while at the same time
eliminating all device-specific functionality in the VMM. Hardware support on the platform closes this gap, by allowing
the device to be safely assigned to an IOVM, thus allowing full protection from errant DMA transfers.

Figure 4: IOVM software architecture
|