- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
Accelerator Exoskeleton
EXOCHI PROTOTYPE
The EXOCHI framework described in this paper has already been deployed within Intel® for successful development of production- quality, GMA X3000 media-processing kernels and other workloads of growing importance [2]. Figures 8 and 9 provide examples of the use of how an IA look-n-feel allows familiar development tools and environments to be used in writing heterogeneous multi- shredded code. Figure 8 shows the use of familiar legacy development tools (Microsoft Visual Studio*) for development and debugging of heterogeneous multi-shredded code. Figure 9 illustrates the compilation and execution of such a program.

Figure 8: IA Look-n-Feel IDE (Microsoft Visual Studio) for application development
click image for larger view

Figure 9: IA Look-n-Feel compilation and execution
click image for larger view
Performance Evaluation
To evaluate the performance of our EXOCHI prototype we select a representative subset of the kernels that have been developed. These kernels exhibit a significant amount of data- and thread-level parallelism and thus, readily lend themselves to efficient execution on the GMA X3000 exo-sequencers.
Implementation of these kernels is made easy due to special GMA X3000 ISA features optimized for media processing. The key ISA features include wide SIMD instructions, predication support, and a large register file of 64 to 128 vector registers for each GMA X3000 exo-sequencer. With CHI, programmers can directly use the GMA X3000 ISA features via inline assembly in C/C++ code as if they are traditional ISA extensions to IA, such as SSE. By providing such IA look-n-feel, CHI enables highly productive development of heterogeneous multi-shredded code.
All benchmarks are compiled with the enhanced version of the Intel® C++ Compiler using the most aggressive optimization settings (fast Qprof_use). These compiler optimizations include auto-vectorization, profile-guided optimization, and tune specifically for the Intel® Core™2 Duo processor used in the EXO prototype system. LinearFilter, SepiaTone and FGT make use of the optimized and SSE-enhanced Intel IPP library, and the other benchmarks were manually tuned and SSE-optimized. Performance results measure the wall clock execution time.

Figure 10: Speedup from execution on GMA X3000 exo-sequencers over IA sequence
click image for larger view
Performance Speedup on GMA X3000 Exo-sequencers over IA Sequencer
Figure 10 shows the speedup achieved over IA sequencer execution by executing media kernels on the GMA X3000 exo-sequencers. Significant speedup is achieved, ranging from 1.41X for BOB up to 10.97X for Bicubic. Two factors are crucial in achieving this high throughput performance on the GMA X3000 exo-sequencers. Most important is the availability of abundant shred-level parallelism. As each GMA X3000 exo-sequencer supports only in-order execution within a shred, the accelerator relies on the presence of multiple concurrent shreds to cover up stalls incurred in one shred by switching to another shred. A second, but related issue, is the need to maximize cache hit rate and memory bandwidth utilization. The GMA X3000 supports simultaneous execution of 32 hardware threads, each of which might be reading and writing multiple data streams. The CHI runtime allows programmers to carefully orchestrate shred scheduling to ensure shreds accessing adjacent or overlapping macroblocks are ordered closely together in the work queue so as to take advantage of spatial and temporal localities.
Other than support for thread-level parallelism, the GMA X3000 ISA also provides strong support for data-level parallelism. It features significantly wider SIMD operations (8- to 16-wide vector) than the SSE on today's IA CPU.

Figure 11: Impact of shared virtual memory
click image for larger view
Impact of Data Copying Versus Shared Virtual Address Space
In general, the performance improvement achieved by using an accelerator is determined not only by the accelerator architecture but also by the overhead of data communication between the CPU and accelerator. This overhead varies greatly depending on the memory model between the CPU and the accelerator. Figure 11 shows overall performance improvement achieved with a cache coherent shared virtual memory model between the IA sequencer and the GMA X3000 exo-sequencers. In the absence of cache coherence or shared memory, the data communication overhead can significantly degrade the speedup achieved from accelerating the computation. In Figure 11 we contrast performance impacts for three memory model configurations.
The first configuration, Data Copy, assumes a model without shared virtual memory and no cache coherence between the IA sequencer and the GMA X3000 exo-sequencers. Consequently, data communication between IA shred and GMA X3000 shreds requires explicit data copying, for which we assume a 3.1GB/s data copy rate. This corresponds to an aggressive data copy rate using an SSE-enhanced memory copy routine when copying data from a cacheable memory source to a destination region marked as uncacheable, write- combining memory. The Intel Core 2 Duo processor features special write-combining buffers that allow aggressive burst mode transfers when copying from cacheable memory to write-combining memory. Due to the lack of shared virtual memory, the inter-shred communication between the IA shred and GMA X3000 shreds resembles that of traditional message passing communication between processes from different address spaces.
The second configuration, Non-CC Shared, assumes a shared virtual address space but without cache coherency between the IA sequencer an the GMA X3000 exo-sequencers. Data copying can be avoided in this case as both the IA sequencer and GMA X3000 exo- sequencers can access the identical physical memory location for the same virtual address. Memory writes performed by the IA sequencer or the GMA X3000 exo-sequencers may not be visible to the other until after a cache flush operation, which forces any dirty cache lines to be written back to main memory. However, data communication can still be accomplished by passing a pointer to a shared data structure between the IA sequencer and a GMA X3000 exo-sequencer as long as cache flush operations are appropriately invoked. Due to the lack of cache coherence, the IA shred and the GMA X3000 shreds need to use critical sections to enforce mutually exclusive access to shared data structures. The semaphore on the critical section will not be released until the GMA X3000 exo-sequencers completely flush the dirty lines to memory.
The third configuration, CC Shared, models a cache-coherent shared virtual address space, which is the configuration assumed in Figure 10. In this model, data communication between the IA shred and the GMA X3000 shreds becomes much more efficient. Similarly, the synchronization on mutual access to shared data structure is also made much easier for programmers. For example, while critical sections are still necessary to provide mutual exclusion on writes to a shared variable, one shred can always read the shared variables that are updated by the other shreds. This allows more execution concurrency between shreds.
The performance data in Figure 11 demonstrate the benefits of a shared virtual address space compared to data copying. While significant performance improvement is still possible even with data copying, for computationally intensive kernels (e.g., bicubic and ADVDI), the gains are significantly reduced from the original CC Shared configuration in cases such as LinearFilter and BOB. For benchmarks in which the GMA X3000 performs little computation on the loaded input data, the time to copy data between separate address spaces represents a significant fraction of the processing time. Even with a highly optimized implementation on the latest IA Intel Core 2 Duo processor, the data copying achieves only 70.5% of that seen for a coherent shared virtual address space.
The cost of copying data can be ameliorated if the IA sequencer and the GMA X3000 exo-sequencers operate within a shared virtual address space, even if cache coherency is not supported. The time required to flush caches is still nontrivial, however, and the lack of coherency (Non-CC Shared) still yields 85.3% of the performance achieved with full cache coherency. Support for cache coherence improves performance because the cache flush operation is not needed to synchronize memory accesses.
For the Non-CC Shared configuration, when an IA shred spawns GMA X3000 shreds, it may appear necessary to flush the IA sequencer's cache fully before any GMA X3000 shred can be launched. In reality the majority of the cache flush operation on the IA sequencer can be overlapped with parallel shred execution on the GMA X3000 exo-sequencers if cache flush operations and shred launches can be interleaved. As each exo-sequencer shred only reads and writes a tiny portion of each data buffer (e.g., a 16 pixel by 16 pixel macroblock), as long as those data have been flushed back to memory by the IA producer shred, the exo-sequencer consumer shred for that macroblock can be launched and can execute safely. Additional cache flush operations can then proceed in parallel with useful work being performed in parallel on the exo-sequencers.