Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 04

Multi-Core Software


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1104.01

  • Volume 11
  • Issue 04
  • Published November 15, 2007

Multi-Core Software

  Section 6 of 12  

Inside the Intel® 10.1 Compilers: New Threadizer and New Vectorizer for Intel® Core™2 Processors

ENHANCED LOOP OPTIMIZATIONS

Besides revamping the threadizer and vectorizer, in the Intel® 10.1 compilers, a single unified framework is designed primarily to provide better interaction among loop optimizations, threadizer, and vectorizer. The loop optimizations target cache and memory optimizations that are well known in the literature such as linear loop transformations, distribution, fusion, blocking, unroll-jam, loop-multi-versioning, and scalar replacement [7, 11, 12]. In order to derive the maximum possible performance for programs with effective threadization and vectorization, individual loop optimizations are enhanced and ordered in such a way as to achieve the best memory-locality while retaining the property that the innermost-loop can be efficiently vectorized. Similarly, optimizations are applied to a loop-nest to enable the threadization of the outer loop wherever possible, thus increasing the granularity of parallelism and reducing the overheads.

Loop Distribution Enhancements

Loop Distribution Pass-1 is invoked to generate more coarse-grained threadizable loops with statement re-ordering and grouping while preserving the correctness and perfect nested loops that enable further loop optimizations such as interchange.

Loop Distribution Pass-2 is invoked before vectorization. For each distributed loop, this groups together memory-references that have required stride, data-type, and alignment. These properties ensure efficient vectorization of each such loop (where vectorization is legal) making good use of the available micro-architectural resources. Loop distribution heuristics also trade off maximally distributing for vectorization against improving cache reuse for vectorized loops. Intel® Core™ micro-architecture features more write-combining buffers and larger data caches with higher associativity than previous generations. This enables better performance through vectorization without excessive loop distribution, thereby reducing vectorized loop overheads.

Loop Multi-versioning

The multi-versioning helps to deal with two potential roadblocks that prevent a loop from being vectorized or parallelized. The first roadblock is when the loop contains references with cross-iteration data dependencies. The second one is when the references' cross-iteration strides are unknown, e.g., dope vector based arrays in Fortran90. In either case, the multi-versioning module generates code that checks whether "required conditions" hold during runtime. It also generates different copies of the loop such that each copy is guarded under a different condition, and optimized according to the guarded condition.

For example, if a loop has references a(i) and b(i), and data dependence cannot prove that a and b do not overlap, there are two possible ways that multi-versioning can help. If the compiler decides that vectorization is important, versioning will generate a test to ensure that a(0) and b(0) are at least 16 bytes apart. If this condition is tested true at runtime, a version of the loop that has been vectorized will be run. Otherwise, a non-vectorized version of the loop will be run; the latter version may still be optimized in other ways (e.g., unroll). Both loop versions and the runtime test have been pre-generated into the executable by the compiler. The multi-versioning module generates the different loop versions, and it annotates their properties with internal directives that are then used by the vectorizer.

On the other hand, if threadization is more important, versioning will generate a test to ensure that the arrays do not overlap (using the initial addresses of a and b, and the number of loop iterations). The loop version guarded by this independence test can then be safely parallelized.

Similarly, if a loop has references to dope vector-based arrays (e.g., assumed shape arrays), versioning can generate checks to examine the stride value of the arrays from the dope vectors during runtime. If the strides are all one, the loop may then be efficiently vectorized (assuming other vectorization conditions pass.)

The versioning uses a heuristic to decide on the number of the tests and number of versions of the loops, to reduce the impact on executable size.

Loop Blocking and Unrolling

The loop blocking and loop unrolling phases have been improved for the Intel Core micro-architectures. Based on our experience with application code, the enabling decisions and the optimization parameters have been modified to make the best use of the new cache architecture. The phase ordering of the blocking phase has also been modified with respect to the vectorization phase to extract the maximum benefit possible from these optimizations.

Vectorizer modifies simple inner loops to create vector loops. This leads to complex loop structure that is not amenable to blocking—there are several cases where vectorization degrades performance when compared to just loop blocking. Another loop-blocking phase has been added before vectorization, so that blocking can make better use of the cache, and later vectorization on the innermost blocked loop can further improve parallelism across loop iterations.

The loop blocking phase has also been enhanced in our new unified framework to get the best out of the Intel Core micro-architecture. Blocks or Tiles are used to hold data in the cache and are the stride factors for the outer block-controlling loops. Block or Tile-size selection algorithms are also improved. Our primary focus now is to improve cache locality at the L2 cache level. We try to enable more register re-use by performing unroll-jam (a.k.a register-blocking) of outer loops inside the inner blocked loops.

The mechanism that controls the enabling or disabling of the loop unrolling has been improved. Unrolling can lead to register pressure resulting in poor code performance due to register spills and fills. Besides the obvious cases, it is hard to predict at compile-time whether loop unrolling would help or degrade performance. Our implementation makes this decision based on various program and architectural parameters. Determination of loop unrolling factors also needs to be aware of register pressure in the inner loop. Our experience shows that small unroll factors are effective in most cases.

Loop Fusion and Interchange

Loop fusion combines adjacent conforming nested loops into a single nested loop. This optimization can improve the cache context and increase the amount of computation, thus increasing the granularity of threadization reduced overheads. Loop interchange is done in such a way as to improve threadization at the outer level, and at the same time, keep the memory accesses in the innermost-loop unit-strided to enable efficient vectorization.

  Section 6 of 12  

Back to Top

In this article

Download a PDF of this article.