- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
Accelerator Exoskeleton
EXO ARCHITECTURE
Architecturally, EXO extends the Multiple Instruction Stream Processor (MISP) architecture [7] in three significant ways: (1) MISP exoskeleton (2) Address Translation Remapping (ATR), and (3) Collaborative Exception Handling (CEH). With this architectural support, EXO fundamentally enables a powerful shared virtual memory heterogeneous multi-threaded programming model, despite ISA differences between the IA sequencer and the exo-sequencers.
MISP Exoskeleton
EXO provides a minimal architectural "wrapper," or exoskeleton, to make a non-IA heterogeneous accelerator sequencer conform to the MISP inter-sequencer signaling mechanism. With this exoskeleton, the accelerator sequencer can be exposed as an application- managed sequencer, even though it has a different ISA from the IA sequencers. To distinguish from an application-managed IA sequencer, we call such heterogeneous accelerator sequencers exo-sequencers. The exoskeleton supports interaction with the OS- managed IA sequencer through either initiating or responding to inter-sequencer user-level interrupts. With this enhancement, the code on an OS-managed IA sequencer can use MISP's SIGNAL instruction to dispatch shreds of a non-IA ISA to run on the exo- sequencers. This demands no additional OS support beyond MISP's requirements.

Figure 2: ATR and CEH between heterogeneous sequencers
click image for larger view
Microarchitecture Support
Address Translation Remapping
To support shared virtual memory between the OS-managed IA sequencer and the exo-sequencers, EXO provides an ATR mechanism to allow the IA sequencer to handle page faults on behalf of the exo-sequencers.
Maintaining a shared virtual address space between two sequencers requires the same virtual address to be resolved to the same physical memory address on both sequencers. Among sequencers of the same architecture, this is accomplished by having the sequencers utilize the same page table for address translation. In a heterogeneous multi-core with IA sequencers and non-IA exo- sequencers, however, the page table format understood by each sequencer may differ. Directly accessing the IA page table is not an option for the exo-sequencers in such a case.
EXO solves this problem with its ATR mechanism. With ATR, when an exo-sequencer incurs a translation miss, it suspends shred execution and signals the IA sequencer to request proxy execution in order to service that Translation Lookaside Buffer (TLB) miss or page fault. Like MISP, upon receiving the proxy request as a user-level interrupt, the IA shred transfers control to a proxy handler that will touch the virtual address on behalf of the exo-sequencer. Once the page fault is serviced on the IA sequencer, however, unlike MISP, ATR will transcode the IA page table entry to the format of the exo-sequencer's page table entry before inserting the entry into the exo-sequencer's TLB. The exo-sequencer's TLB then points to the same physical page as the IA's TLB and can directly access the needed data. The exo-sequencer then resumes execution. As shown in Figure 2, an address translation remapping mechanism is responsible for remapping the IA page entry to the native format on the accelerator.
The shared virtual memory space for heterogeneous sequencers provides many benefits over the alternative approaches. It provides the essential architectural foundation to extend the classic shared memory multithreaded programming paradigm to heterogeneous multi-core processors. With a shared virtual address space, shreds from a single memory image executable running on IA sequencers and exo-sequencers can perform data communication and synchronization in familiar and efficient ways, e.g., without having to resort to explicit data copying as is necessary in the loosely-coupled approach.
It is important to note that even though ATR provides the necessary architectural support for a shared virtual address space, ATR by itself does not guarantee or require cache coherence between the IA sequencer and an exo-sequencer. In the absence of hardware support for cache coherence between the IA sequencer and an exo-sequencer, it is the responsibility of the programmer to use critical sections to protect other IA shreds from reading or writing the data being processed by shreds on the exo-sequencers. When an IA shred hands off a shared data structure to a shred on an exo-sequencer to process, the IA shred must first commit any dirty lines to main memory. Similarly, when the exo-sequencer shred completes its computation, it also needs to flush its cache before releasing a semaphore to the IA sequencer.
Clearly, with full cache coherence support between the IA sequencer and the exo-sequencer the programmer's work can be greatly eased. In particular, there is no need to use critical sections to ensure mutual exclusion on reads to the shared working set. This enables more concurrency between shreds on the IA sequencer and the exo-sequencer.
Collaborative Exception Handling
As with page faults, execution on the exo-sequencers could potentially incur exceptions or faults that require OS services. In conventional MISP, if an exception occurs on an application-managed sequencer, the instruction causing the exception can be replayed on the OS-managed sequencer through proxy execution. However, when the exception occurs on a non-IA exo-sequencer, the faulting instruction cannot simply be replayed on the IA CPU sequencer. Because the exo-sequencer uses a different ISA, the faulting instruction might have a data type that is not supported by IA ISA directly, or the exo-sequencer may require a different exception handling convention. To address this, EXO adds hardware support for CEH and a software-based exception handling mechanism, which allows faults or exceptions that occur on the exo-sequencer to be handled by the OS by proxy on the OS- managed IA sequencer.
Through CEH, an exception is handled in a similar fashion to a TLB miss. For example, as shown in Figure 2, when a double precision floating point vector instruction on an exo-sequencer incurs an exception, the exo-sequencer first signals the IA sequencer, as it does with ATR. The IA sequencer then functions as the proxy for the exo-sequencer by invoking an application- level handler to emulate the faulting vector instruction or use an OS service such as Structured Exception Handling (SEH) to provide full IEEE-compliant handling of the exception on the particular excepting scalar element. Once the exception is handled on the IA sequencer, CEH ensures the result is updated on the exo-sequencer before resuming execution.
Accelerator Exo-Sequencer: Two Examples
Media Accelerator
One example of an exo-sequencer accelerator is the integrated Intel® Graphics Media Accelerator X3000 from the Intel® 965G Express
chipset [9]. Figure 3 shows a high-level view of the GMA X3000 hardware. The GMA X3000 contains eight programmable, general-
purpose graphics media accelerator cores, called Execution Units (EU), each of which supports four hardware thread contexts. From
the programmer's perspective, 32 exo-sequencers are available. We use a custom emulation firmware that uses an IA CPU core as the
OS-managed sequencer and uses the 32 GMA X3000 sequencers as exo-sequencers. The firmware implements all essential architectural
extensions required by the EXO architecture, including MISP exoskeleton, ATR, and CEH.
A shred for the GMA X3000 exo-sequencer can be created either by an IA shred or spawned from another GMA X3000 shred. Once created, GMA X3000 shreds are scheduled in a software work queue in shared virtual memory like POSIX threads. The work queue can have a far greater number of shreds than the number of GMA X3000 exo-sequencers. The emulation firmware is responsible for translating a shred descriptor, which includes shred continuation information like instruction and data pointers to the shared memory, into implementation-specific hardware commands that the GMA X3000 exo-sequencers can consume and execute. The emulation layer hides all device-specific hardware details from the programmer.

Figure 3: High-level view of the Intel GMA X3000
click image for larger view
Communication Accelerator
Another example of the exo-sequencer accelerator is the Scalable Communication Cores (SCC) [8]. SCC is a research prototype
designed for a reconfigurable radio baseband that is capable of processing several wireless standard protocols, such as WiFi,
WiMax [12], or cellular infrastructure, with a common set of hardware. The SCC system architecture consists of a heterogeneous
set of coarse-grained, highly optimized baseband Processing Elements (PEs).
One type of PE is the Data Processing Element (DPE) core, which performs computationally intensive operations, such as the Fast Fourier Transform (FFT) that is commonly used in many standard protocols. The DPE core structure consists of control and computation units and several memory blocks. DPE cores are connected via flexible interconnect matrices. Asynchronous data-path swap units support commutations from any of four inputs to any of four outputs. Reconfiguration of the data-path can be done dynamically with interconnection information and operation parameters stored in the configuration cache.
Inside DPE, there is a configuration (CFG) queue that is part of a special task scheduling mechanism. Each task pointer that is pushed onto the CFG queue will be fetched by the core engine. Each launched task becomes an exo-sequencer running on DPE. The DPE can be configured to use multiple CFG queues, thus implying a multi-threaded implementation. This allows multiple exo-sequencers to run concurrently on the DPE engine.
