Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 04

Multi-Core Software


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1104.07

  • Volume 11
  • Issue 04
  • Published November 15, 2007

Multi-Core Software

  Section 7 of 12  

Future-Proof Data Parallel Algorithms and Software on Intel® Multi-Core Architecture

IMPLEMENTING CT FOR FORWARD-SCALING

Figure 2 also illustrates the Ct execution model. The core of Ct-enabled applications is the use of Ct-based data structures and algorithms. In addition, Ct Application Libraries are a set of well-optimized higher-level APIs aiming to boost programmers' productivity for Tera-scale applications such as image processing, linear algebra, and physics simulation. The Ct libraries can be compiled by stock C++ compilers, such as Visual C++, Intel® C/C++, and Gnu C/C++ compilers, into an IA binary that is able to run on all multi-core IA platforms. This binary comprises of either IA32 or Intel® 64 Architecture instructions, which also include calls to the Ct Dynamic Engine. During the execution of the binary, the Ct Dynamic Engine is launched and provides the services essential to performance and forward-scaling. More specifically, the three major services are the Threading Runtime (TRT), Memory Manager (MM), and Just-In-Time (JIT) compiler. In particular, the TRT and JIT (especially the vector abstraction we will introduce called VIP) provide the basis for forward scaling across IA.

The Threading Runtime

The first key to forward-scaling is to dynamically adapt to new architectural characteristics. Threading and synchronization overhead is likely to change between processor generations, necessitating an ability to both select the task granularity and synchronization method dynamically. In fact, our approach is to isolate the architecture-dependent components of the Ct runtime to dynamically loaded libraries. Another aspect of forward-scaling is that data set sizes are likely to scale in the long run, but are generally unpredictable in phases of computation, especially for client applications such as games. In this case, the amount of data being processed is highly scene and gameplay dependent. As such, the runtime must be able to adapt its threading strategy to variable data sets.

The TRT provides a fine-grained threading model that is used to implement both data parallel and task parallel constructs. The underlying building block for this model is a future, which under the runtime semantics may represent a suspended closure or thunk (i.e., a function pointer and an argument list representing a potentially parallel function application), a thunk computation in flight, or a computed value (representing a successfully evaluated thunk). A handle to a future essentially represents a dependency on that suspended thunk's evaluation. This is inspired by the techniques for expressing and managing parallelism presented in [7][8]. Using this mechanism, many complex fine-grained synchronization patterns may be expressed; however, the TRT facilitates fine-grained synchronizations via a building block called a join. A join can be used to express a range of logical combinations of synchronization dependencies.

The TRT uses additional primitives called bulkspawns and bulkjoins, which essentially represent mapped future spawns and joins on collection-oriented arguments. Bulkspawn operations dynamically partition the collection into the right number of fine-grained tasks interlinked with fine-grained synchronizations. This is key to adapting to the core count and utilization, as well as cache footprint.

The Memory Manager

The Ct MM automatically manages the segregated Ct vector space. As such, it provides a set of lock-free memory allocation interfaces, as well as a reference-counting-based garbage collector to reclaim dead vectors automatically. The MM is responsible for allocated data format and, in conjunction with the TRT, partitions vectors for parallel operations (i.e., the TRT bulkspawn operations).

The Compiler

The Ct compiler has a slightly unconventional structure, notably in its dynamic nature. When executing Ct API calls, the dynamic engine constructs intermediate representations of the computation, deferring actual execution (and further optimization) until later. "Later" is bounded by the necessity to copy values back into native C/C++ space, though the engine may decide to compile code at intermediate steps, such as when back edges in control flow (i.e., loops) are detected. This intermediate representation (IR) building is the default mode of Ct code execution for new paths in the program. Otherwise, cached code is executed if the path followed is in the "Code Cache," or a cached IR is augmented.

Once the compiler is invoked, several phases with distinct objectives are invoked: the High-Level Optimizer (HLO), the Low-Level Optimizer (LLO), and the VIP Code Generator (VCG).

The HLO phase [5] performs architecture- and runtime-independent optimizations, such as sub-primitive decomposition (breaking up data parallel operators into more primitive patterns of parallelism), fusion (essential to coarsening the fine-grained concurrency of the Ct model as much as possible), scalarization, common sub-expression elimination, and copy propagation. These optimizations are all possible without introducing the details of run-time memory allocation and shape checks. This is left to LLO.

The LLO phase is still architecture independent, but unlike HLO, LLO does runtime-dependent optimizations. It has three primary objectives: 1) generate parallelized kernels using the TRT; 2) translate the optimized kernels (spawned in the TRT) into proper vectorized code; 3) generate architecture-independent representations, which we call the Virtual Intel® Platform, or VIP. Much of the difficulty of implementing and optimizing collective communication [1] and segmented operators [2] is deferred to this phase. VIP is an abstract instruction set that is based on IA32/Intel 64 Architecture, but that uses a generalized vector ISA to defer binding to a particular generation of SSE as late as possible.

One of the challenges we face for forward-scaling is the near certainty of SSE extensions and enhancements. Figure 3 shows the evolution of Intel SIMD ISA. The number of instructions has been increasing at the pace of 30 instructions per year on average since the introduction of MMX™ technology in 1996. Meanwhile, the data width of vector registers was also increased from 64 bits to 128 bits, and it can be reasonably expected to increase further at some point in the future. This is driven by the need to increase performance in the most power efficient way and extending SIMD ISA is one such mechanism. VIP, as a virtual ISA of the Ct Dynamic Engine, is designed to hide changes in SIMD ISA, and, via VCG, provide future-proof performance to Ct applications.



Figure 3: The evolution of IA SIMD ISA vs. Ct
click image for larger view
 

The VCG phase is a state-of-the-art backend for VIP that dynamically selects the appropriate target ISA. When new SSE extensions are introduced, a new dynamically linked library can be made available that supports both legacy and new SSE extensions. No recompilation with a static compiler is necessary. In this way, applications can forward scale through vector ISA with the adaptivity of the VCG backend. VCG does classic loop-based optimizations, such as loop fusion, loop interchange, and array contraction. VCG also does architecture-dependent optimizations, such as register allocation and instruction scheduling.

By carefully layering architecture and run-time dependent optimizations in the Ct compiler, we can retarget the entire dynamic engine with great agility, including for the purposes of evaluating new micro-architectures (i.e., considering in-order architectures and evaluating new, throughput-oriented ISA extensions). This was done deliberately with forward-scaling in mind.

The dynamic compilation approach, especially the VIP layer, provides smooth migration paths to future SSE and IA-based SIMD instruction sets.

  Section 7 of 12  

Back to Top

In this article

Download a PDF of this article.