- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
Accelerator Exoskeleton
CHI PROGRAMMING ENVIRONMENT
C for Heterogeneous Integration (CHI) is designed to provide an IA look-n-feel programming environment to support user-level multi-shredding on heterogeneous sequencers. In the CHI infrastructure, we enhance the Intel® C++ Compiler to support accelerator- specific inline assembly within the C/C++ source. In addition, we extend OpenMP pragmas to support heterogeneous multi-shredding and provide the related runtime support. The runtime library is responsible for judiciously scheduling heterogeneous shreds across the exo-sequencers. The compiler can also embed debugging information for different ISAs in a single binary. Such information can be used by an enhanced version of the Intel Debugger (IDB) to enable source-level debugging for both C/C++ code on the IA CPU target and the accelerator-specific code on the accelerator target. Figure 4 depicts the overall CHI compilation infrastructure. Three new capabilities are provided in the CHI compiler to allow programmers to express multi-shredded computation for the heterogeneous exo-sequencers in the C/C++ source code:
- A method to specify a region of accelerator-specific computation in either inline assembly or domain-specific language.
- A method to specify fork-join or producer-consumer style shred-level parallel execution for the inline accelerator-specific code region with OpenMP pragmas.
- A method to specify input and output memory regions and live-in values for the accelerator-specific code region.
Inline Accelerator Assembly Support
C/C++ provides a facility to inline assembly code blocks directly within the high-level source code. This capability provides programmers access to new instructions or processor features not exposed through the compiler and allows the most performance- critical parts of a program to be custom optimized in assembly. This inline assembly construct can be naturally extended to provide accelerator-specific inline assembly support.
Many variants of asm keyword and syntax exist. In CHI we adopt the Microsoft MASM syntax, i.e.,
__asm {asm_statements;}
where brackets are used to enclose the assembly statements. __asm is the keyword that indicates the enclosed block of code is a special assembly block written specifically for the given accelerator ISA. The asm_statements enclosed in the ensuing brackets are compiled into an accelerator-specific executable binary. The target ISA for the asm_statements is specified through the enclosing OpenMP pragma with the target clause, which is described in this paper in the section entitled "OpenMP Parallel Pragma Extension." As shown in Figure 4, a separate accelerator-specific assembler is dynamically linked with the Intel compiler. Figure 5 shows an example of C code using the extended OpenMP pragmas and CHI runtime APIs for a heterogeneous target consisting of an IA32 sequencer and GMA X3000 exo-sequencers.

Figure 4: CHI compilation flow
click image for larger view
Similar to traditional inline assembly, this accelerator-specific assembler generates code for the target ISA by translating the inline assembly instructions enclosed in the brackets into binary code and resolving symbolic names for memory locations and other entities referenced within the assembly block. After the assembler compiles the assembly block, the resulting binary code is embedded in a special code section of the executable indexed with a unique identifier. The final executable is a fat binary, consisting of binary code sections corresponding to different ISAs.
Domain-specific Language Support
In addition to supporting accelerator-specific inline assembly, the capability of the C/C++ compiler can be further extended to provide a facility to inline domain-specific language blocks directly within the high-level source code. These domain-specific languages are designed to utilize the accelerator-specific features not exposed through the general C/C++ programming environment. Therefore, the programmers can take advantage of the full capability of the underlying accelerators without programming the exo-sequencer directly in assembly language.

Figure 5: Example GMA X3000 inline assembly
click image for larger view
To provide a uniform programming interface to programmers, we adopt the format similar to that of the asm syntax, i.e.,
__<language keyword>{domain-specific language statements;}
where brackets are used to enclose the domain-specific language statements. __<language keyword> can be any language that is supported by CHI. Upon parsing the particular language keyword, the C/C++ compiler invokes the corresponding domain-specific compiler plug-ins to generate the accelerator-specific binary, similar to how it is done with the inline assembly support as described in the section entitled "Inline Accelerator Assembly Support."
Figure 6 shows an example of the domain-specific language support to the Data-stream Programming Language (DPL) that is specifically designed for the retargetable SCC-DPE accelerator. DPL provides essential high-level functions to exploit the inner microarchitecture of the DPE systolic arrays. The programmers can embed DPL code within the brackets preceded by the __dpl keyword.
OpenMP Parallel Pragma Extension
CHI extends the OpenMP parallel pragma. The construct for generating heterogeneous shreds of an accelerator-specific instruction set is outlined in Figure 7(a). The target clause specifies the particular accelerator instruction set used within the parallel region. The compiler inserts appropriate calls to the CHI runtime layer to enable judicious dynamic shred scheduling and dispatching onto the targeted exo-sequencers. When the main IA shred encounters an accelerator-specific parallel construct with the target(targetISA) clause, the IA shred spawns a team of num_threads heterogeneous shreds for the parallel region, where each shred eventually executes the enclosed assembly block on an exo-sequencer.

Figure 6: Example inline DPL code using CH
click image for larger view
By default, the main IA shred waits at the end of the construct until it is notified by the CHI runtime of the completion of all heterogeneous shreds. Similar to the traditional nowait clause, an optional master_nowait clause allows the main IA shred to continue execution past the construct after spawning the team of heterogeneous shreds, without having to wait for their completion. This allows concurrent execution on both the IA sequencer and its exo-sequencers. The CHI runtime is responsible for asynchronously notifying the IA sequencer of the eventual completion of all heterogeneous shreds.
OpenMP Work-Queuing Extension
In order to support concurrent threads with intricate dynamic inter-thread dependencies (e.g., due to the use of irregular data structures), the Intel C++ Compiler supports irregular parallelism through two special OpenMP pragmas, taskq and task [23]. In CHI, we further enhance the compiler and runtime to support inter-shred dependencies among heterogeneous shreds using these pragmas. The parallel taskq construct and the task construct for an exo-sequencer are outlined in Figure 7(b) and Figure 7(c).

Figure 7: CHI extensions to OpenMP pragmas
click image for larger view
CHI Runtime Support
The CHI runtime is a software library that translates the programmer-specified OpenMP directives into primitives to create and manage shreds that can carry out parallel execution on the heterogeneous multi-core target. Like conventional OpenMP runtimes, the CHI runtime layer provides a layer of abstraction that hides the details of managing the exo-sequencers from the programmer.
In order to allow the accelerator more efficient access to the C/C++ variables specified by the shared data clause, programmers can use the CHI runtime APIs to convey accelerator-specific access information through data structures known as descriptors. Descriptors are used by the accelerator to interpret the attributes of the shared variables that are accessed by the shreds.