|
To enforce the isolation, security, reliability, and performance benefits of direct assignment, we need efficient
hardware mechanisms to constrain the operation of I/O devices. The primary I/O device accesses that require this
isolation are device transfers (DMAs) and interrupts. CPU virtualization mechanisms are sufficient to efficiently
perform device discovery and schedule device operations.
Accordingly, VT-d [12] provides the platform hardware support for DMA and interrupt virtualization.
DMA Remapping
DMA remapping facilities have been implemented in a variety of contexts in the past to facilitate different usages. In
workstations and server platforms, traditional I/O memory management units (IOMMUs) have been implemented in PCI root
bridges to efficiently support scatter/gather operations or I/O devices with limited DMA addressability [17]. Other
well-known examples of DMA remapping facilities include the AGP Graphics Aperture Remapping Table (GART) [18], the
Translation and Protection Table (TPT) defined in the Virtual Interface Architecture [14], and subsequently influencing
a similar capability in the InfiniBand Architecture [16] and Remote DMA (RDMA) over TCP/IP specifications [19]. DMA
remapping facilities have also been explored in the context of NICs designed for low latency cluster interconnects [15].
Traditional IOMMUs typically support an aperture-based architecture. All DMA requests that target a programmed aperture
address range in the system physical address space are translated irrespective of the source of the request. While this
is useful for handling legacy device limitations (such as limited DMA addressability or scatter/gather capabilities),
they are not adequate for I/O virtualization usages that require full DMA isolation.
The VT-d architecture is a generalized IOMMU architecture that enables system software to create multiple DMA protection
domains. A protection domain is abstractly defined as an isolated environment to which a subset of the host physical
memory is allocated. Depending on the software usage model, a DMA protection domain may represent memory allocated to a
VM, or the DMA memory allocated by a guest-OS driver running in a VM or as part of the VMM itself. The VT-d architecture
enables system software to assign one or more I/O devices to a protection domain. DMA isolation is achieved by
restricting access to a protection domain's physical memory from I/O devices not assigned to it, through address-
translation tables.
The I/O devices assigned to a protection domain can be provided a view of memory that may be different than the host
view of physical memory. VT-d hardware treats the address specified in a DMA request as a DMA virtual address (DVA).
Depending on the software usage model, a DVA may be the Guest Physical Address (GPA) of the VM to which the I/O device
is assigned, or some software-abstracted virtual I/O address (similar to CPU linear addresses). VT-d hardware transforms
the address in a DMA request issued by an I/O device to its corresponding Host Physical Address (HPA).
Figure 5 illustrates DMA address translation in a multi-domain usage. I/O devices 1 and 2 are assigned to protection
domains 1 and 2, respectively, each with its on view of the DMA address space.

Figure 5: DMA remapping
click image for larger view
Figure 6 illustrates a PC platform configuration with VT-d hardware implemented in the north-bridge component.

Figure 6: Platform configuration with VT-d
click image for larger view
Mapping Devices to Protection Domains
To support multiple protection domains, the DMA remapping hardware must identify the device originating each DMA
request. The requester identifier of a device is composed of its PCI Bus/Device/Function number assigned by PCI
configuration software and uniquely identifies the hardware function that initiated the request. Figure 7 illustrates
the requester-id as defined by the PCI specifications [20].

Figure 7: PCI requester identifier format
click image for larger view
VT-d architecture defines the following data structures for mapping I/O devices to protection domains (see Figure 8):
-
Root-Entry Table: Each entry in the root-entry table functions as the top-level structure to map devices for a
specific PCI bus. The bus-number portion of the requester-id in DMA requests is used to index into the root-entry table.
Each present root entry includes a pointer to a context-entry table.
-
Context-Entry Table: Each entry in the context-entry table maps a specific I/O device on a bus to the protection
domain to which it is assigned. The device and function-number portion of the requester-id is used to index into the
context-entry table. Each present context entry includes a pointer to the address translation structures used to
translate the address in the DMA request.

Figure 8: Device mapping structures
click image for larger view
Address Translation
VT-d architecture defines a multi-level page-table structure for DMA address translation (see Figure 9). The multi-level
page tables are similar to IA-32 processor page-tables, enabling software to manage memory at 4 KB or larger page
granularity. Hardware implements the page-walk logic and traverses these structures using the address from the DMA
request. The number of page-table levels that must be traversed is specified through the context-entry referencing the
root of the page table. The page directory and page-table entries specify independent read and write permissions, and
hardware computes the cumulative read and write permissions encountered in a page walk as the effective permissions for
a DMA request. The page-table and page-directory structures are always
4 KB in size, and larger page sizes (2 MB, 1 GB, etc.) are enabled through super-page support.

Figure 9: Example 3-level page table
click image for larger view
Interrupt Remapping
For proper device isolation in a virtualized system, the interrupt requests generated by I/O devices must be controlled
by the VMM. In the existing interrupt architecture for Intel® platforms, a device may generate either a legacy interrupt
(which is routed through I/O interrupt controllers) or may directly issue message signaled interrupts (MSIs) [20]. MSIs
are issued as DMA write transactions to a pre-defined architectural address range, and the interrupt attributes (such as
vector, destination processor, delivery mode, etc.) are encoded in the address and data of the write request. Since the
interrupt attributes are encoded in the request issued by devices, the existing interrupt architecture does not offer
interrupt isolation across protection domains.
The VT-d interrupt-remapping architecture addresses this problem by redefining the interrupt-message format. The new
interrupt message continues to be a DMA write request, but the write request itself contains only a "message identifier"
and not the actual interrupt attributes. The write request, like any DMA request, specifies the requester-id of the
hardware function generating the interrupt.
DMA write requests identified as interrupt requests by the hardware are subject to interrupt remapping. The requestor-id
of interrupt requests is remapped through the table structure. Each entry in the interrupt-remapping table corresponds
to a unique interrupt message identifier from a device and includes all the necessary interrupt attributes (such as
destination processor, vector, delivery mode, etc.). The architecture supports remapping interrupt messages from all
sources including I/O interrupt controllers (IOAPICs), and all flavors of MSI and MSI-X interrupts defined in the PCI
specifications.
Software Usages of DMA and Interrupt Remapping
The VT-d architecture enables DMA and interrupt requests from an I/O device to be isolated to its assigned protection
domain. This capability makes possible a number of usages:
-
Remapping for legacy guests: In this usage an I/O device is assigned directly to a VM running a legacy
(virtualization unaware) environment. Since the guest OS has the guest-physical view of memory in this usage, the VMM
programs the DMA remapping structures for the I/O device to support appropriate GPA to HPA mappings. Similarly, the VMM
may program the interrupt-remapping structures to enable the interrupt requests from the I/O device to target the
physical CPUs running the appropriate virtual CPUs of the legacy VM.
-
Remapping for IOMMU-aware guests: An OS may be capable of using DMA and interrupt remapping hardware to improve its
OS reliability or for handling specific I/O-device limitations. When such an OS is running within a VM, the VMM may
expose virtual (emulated or paravirtualized) remapping hardware to the VM. The OS may create one or more protection
domains each with its own DMA Virtual Address (DVA) space and program the virtual remapping hardware structures to
support DVA to Guest Physical Address (GPA) mappings. The VMM must virtualize the remapping hardware by intercepting
guest accesses to the virtual hardware and shadowing the virtual remapping structures to provide the physical hardware
with structures for DVA to HPA mappings. Similar page table shadowing techniques are commonly used by the VMM for CPU
MMU virtualization.
Hardware Caching and Invalidation Architecture
To improve DMA and interrupt-remapping performance, the VT-d architecture allows hardware implementations to cache
frequently used remapping-structure entries. Specifically, the following architectural caching constructs are defined:
-
Context Cache: Caches frequently used context entries that map devices to protection domains.
-
PDE (Page Directory Entry) Cache: Caches frequently used page-directory entries encountered by hardware during page
walks.
-
IOTLB (I/O Translation Look-aside Buffer): Caches frequently used effective translations (results of the page walk).
-
Interrupt Entry Cache: Caches frequently used interrupt-remapping table entries.
These caching structures are fully managed by the hardware. When updating the remapping structures, the software is
responsible for maintaining the consistency of these caches by invalidating any stale entries in the caches. VT-d
architecture defines the following invalidation options:
-
Synchronous Invalidation: The synchronous invalidation interface uses a set of memory-mapped registers for software
to request invalidations and to poll for invalidation completions.
-
Queued Invalidation: The queued-invalidation interface uses a memory-resident command queue for software to queue-
invalidation requests. Software synchronizes invalidation completions with hardware by submitting an invalidation-wait
command to the command queue. Hardware guarantees that all invalidation requests received before an invalidation-wait
command are completed before completing the invalidation-wait command. Hardware signals the invalidation-wait command
completion either through an interrupt or by coherently writing a software-specified memory location. The queued-
invalidation interface enables usages where software can batch invalidation requests.
Scaling Address Translation Caches
Caching of the remapping structures enables hardware to minimize the DMA translation overhead that may otherwise be
incurred when accessing the memory-resident translation structures. One of the challenges for DMA-remapping hardware
implementations is to efficiently scale its hardware caching structures. Unlike CPU TLBs that support accesses from a
CPU that is typically running one thread at a time, the DMA-remapping caches handle simultaneous DMA accesses from
multiple devices, and often multiple DMA streams from a device.
This difference makes sizing the IOTLBs in DMA-remapping hardware implementations challenging, especially when the
hardware design is re-used across a wide range of platform configurations. An approach to scaling the IOTLBs is to
enable I/O devices to participate in DMA remapping by requesting translations for its own memory accesses from the DMA-remapping
hardware and caching these translations locally on the I/O device in a Device-IOTLB.
To facilitate scaling of address translation caches, PCI Express* protocol extensions (referred to as Address
Translation Services (ATS)) [22] are being standardized by the PCI Special Interest Group (PCI-SIG) [21]. ATS consist of
a set of PCI transactions that allow the optimization of VT-d address translations. These extensions enable I/O devices
to request translations from the root complex and for the root complex to return responses for each translation request.
I/O devices may cache the returned translations in its local Device-IOTLBs and indicate if a DMA request is using un-
translated address or translated address from its Device-IOTLB. To support usages where software may dynamically modify
the translations, the ATS protocol extensions enable the root complex to request invalidations of translations cached in
the Device-IOTLB of an I/O device, and for the I/O devices to return responses indicating when an invalidation request
is completed.
VT-d architecture supports ATS protocol extensions and enables software to control (through the device-mapping
structures) if an I/O device can issue these transactions. For DMA requests indicating translated addresses from allowed
devices, VT-d hardware bypasses the DMA-address translation.
I/O devices may implement Device-IOTLBs and support these protocol extensions to minimize performance dependencies on
the DMA-remapping caching resources in the platform. However, to preserve the security, isolation, and reliability
benefits of DMA remapping, device implementations must ensure that only translation responses from the root complex
cause entries to be inserted into the Device IOTLB.
Handling Remapping Errors
Any errors or permission violations detected as part of remapping a DMA request are treated as a remapping fault. Unlike
CPU page faults, which are restart-able at instruction boundaries, DMA-remapping faults are not restart-able due to the
posted nature of PCI transactions. Any DMA write request that generates a fault is blocked by the remapping hardware,
and the DMA read requests return an error to the device in the read response. Hardware logs detail DMA requests that
cause remapping faults and use a fault event (interrupt) to inform software about such faults. For devices that
explicitly request translations, an error detected while processing the translation request is not treated as a DMA-
remapping fault, but is merely conveyed to the device in the translation response. This enables such devices to support
device-specific demand page faulting. Demand page faulting is beneficial for devices (such as graphics adapters) with
large DMA footprints, enabling software to demand pin the DMA buffers.
|