Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 04

Multi-Core Software


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1104.01

  • Volume 11
  • Issue 04
  • Published November 15, 2007

Multi-Core Software

  Section 4 of 12  

Inside the Intel® 10.1 Compilers: New Threadizer and New Vectorizer for Intel® Core™2 Processors

REVAMPING THE THREADIZER

In this section, we present our new threadizer framework that is highly integrated with our classical high-level loop optimizations, and we describe its main components. The strengths of the new threadizer include the following:

  • A new Abstract Thread Representation (ATR), based on the concept of virtual threads, is designed to bridge the semantic gap between high-level representation and physical (hardware or OS) threads.
  • Better interaction with other high-level loop-related optimizations gives better performance.
  • The new threadizer is moved downstream to take advantage of scalar optimizations such as global constant propagation and Single-Static-Assignment (SSA) PRE, and some loop optimizations.
  • A table-driven cost model simplifies maintenance and future extensibility.
  • Effective runtime threadization control and multiple schedule types such as static, dynamic, guided, and runtime are supported.

The threadizer in the Intel® compiler serves as a single module that covers different languages (C++ and Fortran), architectures (IA-32, Intel® 64, and IA-64), and operating systems (Microsoft Windows*, Linux*, and MacOS*).

Virtual Threads

Our new threadization framework is based on the concept of virtual threads. A virtual thread is an abstraction above physical threads provided by hardware threads or OS threads. Virtual threads can consist of arbitrary code blocks and have no nesting-level constraints as long as they obey the specified program execution order.

A virtual thread denoted as a quadruple V(a, s, e, d) corresponds to a thread with instruction entry s, instruction exit e, data environment d, and thread id a that are assigned at runtime. An important property of a virtual thread is its lexical nesting level, which is denoted as depth(V(a, s, e, d)). The depth is computed recursively as follows:

When V(a, s, e, d) represents a thread at the outer-most lexical nesting level of parallel constructs, we set its nesting level to depth(V(a, s, e, d)) = 0. When V(a, s, e, d) represents a thread at an inner lexical nesting level of parallel constructs, we set its nesting level to depth(V(a, s, e, d)) = depth(parent(V(a, s, e, d))) + 1.

This lexical nesting-level property is not to be confused with the dynamic (runtime) nesting level of the physical threads supported by the threading runtime library. Another property of a virtual thread is its code block type (a loop, a region, a section, or a task) that can distinguish different threading semantics of a virtual thread and can guide the compiler to generate threaded code according to the definitions of the compiler-to-runtime interface. We say that a virtual thread is mapped to a physical thread (or a runtime thread) when the virtual thread is assigned a unique thread identifier a at runtime. Note that a virtual thread can be mapped to a team of physical threads for a parallel loop and region by assigning a unique thread identifier for each mapping.

Threadization Framework

Figure 2 outlines the new framework. The first two phases extract thread-level parallelism within different program scopes to construct virtual threads. The next two phases de-virtualize virtual threads progressively to match precise threading runtime constraints. The final phase lowers a virtual thread to threaded Intermediate Language (IL) by emitting calls supported by the runtime library.

Phase I: Enabling transformations and loop analysis. This phase enables loop transformations that can increase thread-level parallelism, improve data locality, and identify threadizable loops within a compilation unit (or routine). This phase is enabled with Inter-Procedural Optimization (IPO) as well. Therefore, it is not limited to a single compilation unit, but rather allows whole-program parallelism extraction.

Phase II: Virtual thread graph construction. This phase extracts thread-level parallelism captured by parallel loops and it constructs sibling/nesting relationships between virtual threads. In addition, it also collects private, firstprivate, lastprivate, and reduction variables to build the data environment d for each virtual thread.

Phase III: Devirtualization via privatization. This phase conducts transformations for all private, firstprivate, lastprivate, and reduction variables that are captured by the data environment d of virtual threads. For instance, given firstprivate(x), a local clone thr_x of global variable x is created on the stack and initialized with the value of x. All memory references to x in the thread are then substituted with thr_x.



Figure 2: The new threadization framework
click image for larger view
 

Phase IV: Devirtualization via loop partition. This phase partitions a loop using the thread identifier a based on a default schedule setting, or a scheduling type and chunk size specified with –par-schedule- <type> and –par-schedule-size options. The loop partition is represented internally with the following format:

LPARTITION (tid, sched, cs, lv, glow, gup, gstride, vlow, vup)

where tid denotes the thread identifier, sched denotes the loop scheduling type, cs denotes chunk size, lv denotes whether the code for computing last value is needed or not (FALSE means last-value is not needed), glow and gup denote the original loop lower and upper bounds, and gstride denotes the original loop stride. The parameters vlow and vup denote the loop's lower and upper bounds after loop partitioning for the virtual thread, and they are computed by an OpenMP runtime library routine to which we pass in the other parameters in LPARTITION.

Phase V: Threaded-code generation. This phase maps a virtual thread to the compiler's intermediate code such as IL statements or intrinsics, and to OpenMP runtime library calls according to the target platform. These statements and calls include (i) _fork_threads call that creates physical threads; (ii) a loop partitioning call to compute vlow, vup based on loop information captured in the LPARTITION of each virtual thread node; (iii) a T-entry and T-return pair of statements for the virtual thread based on the MET technology presented in [10].

A distinct characteristic of the new framework is that the threadization is carefully broken down into a sequence of transformations, each of which gradually transforms a virtual thread IL, without a thread identifier, to a virtual thread IL parameterized by a unique symbolic thread identifier. This process is clearly illustrated by the evolution of data properties and code re-structuring through each phase.

Loop Transformations for Threadization

Under the new framework, the compiler performs all necessary loop transformations to achieve a good data locality while preserving and enabling threadization opportunities. Consider the following loops from the subroutine Bench_StagegeredLeadfrog2 in 436.cactusADM of the SPEC* CPU2006 benchmarks.

km = 1 do j=1,ny do i=1,nx lalp(i,j,km) = alp(i,j,1) fac = -2.0d0*dt*lalp(i,j,1) ADM_gxx(i,j,1) = ADM_gxx_p(i,j,1)+fac*ADM_kxx_stag_p(i,j,1) lgxx(i,j,km) = ADM_gxx(i,j,1) ... end do end do k0 = 2 do j=1,ny do i=1,nx lalp(i,j,2) = alp(i,j,2) fac = -2.0d0*dt*lalp(i,j,2) ADM_gxx(i,j,2) = ADM_gxx_p(i,j,2)+fac*ADM_kxx_stag_p(i,j,2) lgxx(i,j,k0) = ADM_gxx(i,j,2) ... end do end do

In Phase I, the compiler analysis proves that there are no loop-carried data dependencies for the loops, and no data dependencies that prevent loop fusion. Thus, the actions taken by the compiler are to fuse the two loops first, and then to perform the steps described in the previous section. When threadization is done, the compiler emits a "FUSED LOOP WAS PARALLELIZED" diagnostic. In this example, loop fusion increases the granularity of the parallel loop, which is an effective loop transformation to reduce thread forking and mapping overhead. After threadization, the vectorization phase will operate on the virtual thread code. In this example, the compiler continues by distributing the i-loop to restrict the number of data streams per resulting loop, which favors write-buffer combining, and then it vectorizes the resulting smaller loops. The compiler emits two "PARTIAL LOOP WAS VECTORIZED" diagnostics in this case. This indicates that an effective interaction of loop transformations, threadization, and vectorization can leverage the full potential of the Intel® Core™2 Duo and Quad processor to achieve higher performance.

Cost Model for Threadization

Once a threadizable loop is identified in Phase I, Phase II forms a region within which the virtual thread will be constructed at compile time. Additionally, as the cost of thread activation and synchronization in a real system is in the range of hundreds of cycles on Intel Core 2 Duo and Quad processors, a key criterion in selecting a proper parallel loop candidate is to minimize the overhead of thread management.

A complementary goal is to ensure that the virtual thread, once invoked, runs for an adequate number of cycles in order to amortize the thread activation cost. Therefore, it is desirable to choose a loop that iterates a reasonably large number of times. The cost estimation is done via a table-driven technique based on the Intel Core 2 Duo and Quad processor instruction latency information combined with the profiling information of basic block execution counts. This algorithm is effective, especially when combined with function inlining.

Runtime Threadization Control

Statically, loops that incur a large number of instruction cycles and no loop-carried data dependencies are identified for threadization. However, selecting an appropriate loop for threadization requires that loop tripcount and number of cycles taken for each iteration are known. Often, the loop's lower and upper bounds are unknown at compile time, so the compiler can not compute the tripcount statically. In general, the static cost analysis may not provide an accurate cost estimation to guide and guard threadization in this case. To solve this issue, the new threadizer generates symbolic runtime test expressions and multi-versioned loops. Assume the symbolic tripcount expression of loop L is Etripcount(L), the estimated execution cycles of loop body of loop L is C(L). The following runtime tests are generated to control the threadization at runtime:

  • C(L) < Thresholdpar
  • Etripcount(L) × C(L) < Thresholdpar

Multi-versioning is necessary for runtime threadization control. Consider, for example, the following sequential loop in C:

unsigned i, nd[1000];
/* size and target are incoming arguments */
for (i=0; i nd[i] *= (1 << target);
}

This loop is selected as a candidate loop for threading based on loop analysis. Then, static cost analysis finds that C(L) < Thresholdpar; however, the loop's upper bound t0 (representing size) and the tripcount t4 in the IL below are unknown at compile time. Hence, a runtime test code t4 < Thresholdpar / C(L) is generated together with two-versioned loops. The pseudo-threaded code generated is sketched below.

BEGIN REGION t97 = _ok_to_fork(); if ( t97 != 0) { if (t4 < 1363) JMP L123; // runtime test u = t0; // t0 represents &quot;size&quot; p1 = t1; // t1 represents &quot;target&quot; _fork_threads(. _parloop, &u, &p1, &nd, .); } else { L123: DO i = 1, t0 nd[i-1] = nd[i-1]*(1<<t1); END DO } END REGION BEGIN REGION // threadized code T-entry _parloop( (..., tid, upper, nd, p1); low = 1; // original loop lower bound up = upper; // original loop upper bound gu = up; _loop_partition(tid, sched,... , &low, &up); t105 = low; // local loop lower bound t106 = up; // local loop upper bound t1 = p1; // p1 represents &quot;target&quot; if (t105 <= gu) { DO i = 0, t106 &#150; t105 (*nd)[i + t105] = (*nd)[i + t105] * (1 << t1); END DO } _join_threads(tid); T-return; END REGION

In this example, if t4 is less than 1363, the execution will switch to serial loop to avoid threading overhead. The runtime threadization control is a simple yet efficient way for parallelizing loops with unknown bounds at compile time. We obtained good speedup by emitting multi-versioned serial and threaded code at compile time, and using runtime tests to select the most beneficial version to execute in some applications.

  Section 4 of 12  

Back to Top

In this article

Download a PDF of this article.