- Home›
- Technology and Research›
- Intel Technology Journal›
- Multi-Core Software
Multi-Core Software
Inside the Intel® 10.1 Compilers: New Threadizer and New Vectorizer for Intel® Core™2 Processors
REVAMPING THE VECTORIZER
The new vectorizer is designed to be tightly integrated with our existing enhanced high-level loop transformation framework. The strengths of the new vectorizer include the following:
- A new Abstract Vector Representation (AVR) is designed to bridge the semantic gap between high-level representation and low-level instruction.
- Better interaction with the new FP-model and other loop optimizations produces better performance.
- The new vectorizer is moved downstream to use SSA and leverage global constant propagation and Common Sub-expression Elimination (CSE).
- Table-driven type selection and code generation with a well-tuned cost model simplify maintenance and future extensibility.
Essentially, the vectorizer converts sequential code into a vector form that exploits all Streaming SIMD Extensions. Consider, for example, the following sequential loop in C:
When compiled for a target architecture that supports SSE2, the compiler generates a vectorized loop with the following assembly code:
Here, the compiler first recognizes a vector loop with idiomatic saturation arithmetic and proper alignment of all access patterns and subsequently converts the code into appropriate SIMD instructions with vector length 16. Due to the removal of a conditional branch relative to a sequential implementation of the loop, in this particular case, vectorization typically exhibits a speedup that exceeds the vector length.
Vectorization for Streaming SIMD Extensions strongly resembles vectorization for traditional vector architectures [1, 11], like a pipelined vector processor. There are a few important differences as well [2], briefly described below:
- A relatively short and fixed vector length requires a sequential "cleanup" loop to deal with the remaining iterations, but it also makes the vector instructions more suitable for fine-grained parallelism, as was first advocated in [2]. The shorter vector length can also be exploited during data dependence analysis.
- A strong sensitivity to natural alignment (typically 16-byte) requires elaborate compiler support to select, detect, or enforce a proper alignment on memory references.
- An idiomatic instruction set requires advanced idiom recognition in the compiler, such as detecting the saturation addition in the example above.
Since vector lengths increase for narrower data types, compiler analysis is required to choose the narrowest possible data type that preserves the original meaning, such as recognizing that all 32-bit operations on variable t can be done in 8-bit precision. An in-depth description of vectorization technology in the Intel® C++/Fortran compiler is given in Reference [2]. For the remainder of this section we focus on a few specifics for the Intel® Core™2 Duo and Quad processors.
Alignment Optimization
In the Intel® Core™ micro-architecture, SIMD performance is still rather sensitive to natural alignment. Therefore, an important aspect of effective vectorization in the compiler is to select, detect, and enforce a favorable alignment on memory references. For instance, the vector loop in the previous section may only use the efficient movdqa to load 16 bytes of data after the compiler has proven that both the initial alignment (alignment on entry of the loop) and the sustained alignment (alignment preserved during execution of a loop)1 of the memory reference a[i] is 16-byte aligned. The less efficient movdqu should have been used if the memory reference had an unknown alignment or was misaligned, because using aligned data movement instructions on unaligned memory locations yields a program fault. The Intel compiler uses a continuously growing assortment of alignment optimizations, including data layout optimization, inter- and intra-procedural alignment propagation, and loop transformations such as static and dynamic loop peeling and multi-versioning [1].
Alignment propagation resembles classical constant propagation, but uses a more elaborate lattice of alignment values <2n, o>, where o denotes a non-negative offset relative to a base 2n and corresponding jump functions. Using a lattice of bases combined with offsets, a method described in [2], propagates more accurate information than just bases and ultimately offers more opportunities for optimizations, such as peeling off unfavorable alignments or using specific instruction sequences for a data movement that splits a cache line. The information is associated with all variables, not just pointers, and has been proven empirically to improve the accuracy of the computed results. A variety of alignment-related optimizations can be found in [1, 3, 6, 8].
Vectorizer Support for SSSE3
The SIMD Extensions 3 [4] extend previous generations of SIMD extensions with sixteen new instructions that can operate on 128-bit operands or old-style 64-bit operands of the MMX™ technology. New instructions most commonly used by automatic vectorization are listed in Table 1.

Table 1: SSSE3 instructions used for auto-vectorization
click image for larger view
The palignr instruction is used to optimize multiple unaligned loads with a statically known offset into aligned loads that are subsequently rearranged into the appropriate vector format. The idiomatic psign instruction is recognized in programming constructs that negate data elements based on the sign of other data elements. The packed absolute value instruction pabs provides a more compact and efficient way of vectorizing this operation than previously-used emulation sequences. Consider, as an example, the following loop that computes the absolute value of all elements in an array of type char.
The generated assembly code for plain SSE2 as well as SSSE3 is illustrated below. In this case, SSE2 shows a ~20x speedup, while SSSE3 shows a ~30x speedup.
Similarly, the phadd instruction provides a more compact way of summing up partial results after vectorization sum reductions [1, 2]. However, the current micro-architectural implementation does not provide any latency reduction over the more elaborate instruction sequences used formerly. Finally, the pshufb instruction provides an efficient way to perform a wide variety of data rearranging, as illustrated with the following loop that operates on two arrays of type char.
This conversion between a little-endian and big-endian representation of 32-bit data elements (4 bytes) can be vectorized effectively as follows.
Here, register xmm0 is pre-loaded with the appropriate 4x4 reshuffling pattern. In fact, any reshuffling of 4 consecutive bytes, even allowing for repeats, can be similarly implemented. The instruction is also used in a peep-hole-like optimization of various data rearranging sequences generated by the vectorizer.
[1] A vector loop using SSE always sustains an initial 16-byte alignment for unit stride memory references. For a scalar loop, the sustained alignment depends on the data width of these memory references.
