Vectorization Programming Guidelines
Guidelines to Vectorize Innermost Loops
- straight-line code (a single basic block)
- vector data only; that is, arrays and invariant expressions on the right hand side of assignments.Array references can appear on the left hand side of assignments.
- only assignment statements.
- function calls (other than math library calls)
- non-vectorizable operations (either because the loop cannot be vectorized, or because an operation is emulated through a number of instructions)
- mixing vectorizable types in the same loop (leads to lower resource utilization)
- data-dependent loop exit conditions (leads to loss of vectorization)
- loop unrolling, which the compiler performs automatically
- decomposing one loop with several statements in the body into several single-statement loops
The compiler is limited by restrictions imposed by the underlying hardware. In the case of Intel® Streaming SIMD Extensions (Intel® SSE), the vector memory operations are limited to stride-1 accesses with a preference to 16-byte-aligned memory references. This means that if the compiler abstractly recognizes a loop as vectorizable, it still might not vectorize it for a distinct target architecture.
Style of source code
The style in which you write source code can inhibit vectorization. For example, a common problem with global pointers is that they often prevent the compiler from being able to prove that two memory references refer to distinct locations. Consequently, this prevents certain reordering transformations.
Guidelines for Writing Vectorizable Code
- Use simpleloops. Avoid complex loop termination conditions – the upper iteration limit must be invariant within the loop. For the innermost loop in a nest of loops, you could set the upper limit iteration to be a function of the outer loop indices.for
- Write straight-line code. Avoid branches such asmost function calls; or,switch, orgotostatements;returnconstructs that can not be treated as masked assignments.if
- Avoid dependencies between loop iterations or at the least, avoid read-after-write dependencies.
- Try to use array notations instead of the use of pointers.C programs in particular impose very few restrictions on the use of pointers; aliased pointers may lead to unexpected dependencies.Without help, the compiler often cannot tell whether it is safe to vectorize code containing pointers.
- Wherever possible, use the loop index directly in array subscripts instead of incrementing a separate counter for use as an array address.
- Access memory efficiently:
- Favor inner loops with unit stride.
- Minimize indirect addressing.
- Align your data to 16-byte boundaries (for Intel® SSE instructions).
- Choose a suitable data layout with care. Most multimedia extension instruction sets are rather sensitive to alignment. The data movement instructions of Intel® SSE, for example, operate much more efficiently on data that is aligned at a 16-byte boundary in memory. Therefore, the success of a vectorizing compiler also depends on its ability to select an appropriate data layout which, in combination with code restructuring (like loop peeling), results in aligned memory accesses throughout the program.
- Use aligned data structures: Data structure alignment is the adjustment of any data object in relation with other objects.You can use the declaration__declspec(align).Use this hint with care. Incorrect usage of aligned data movements result in an exception when using Intel® SSE.
- Use structure of arrays (SoA) instead of array of structures (AoS): An array is the most common type of data structure that contains a contiguous collection of data items that can be accessed by an ordinal index. You can organize this data as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization is excellent for encapsulation it can be a hindrance for use of vector processing. To make vectorization of the resulting code more effective, you can also select appropriate data structures.
Dynamic Alignment Optimizations
Use Aligned Data Structures
Example: Alignment Enforcement
Use Structure of Arrays versus Array of Structures
Point Structure with Data in AoS Arrangement
Points Structure with Data in SoA Arrangement
- Make your data structures vector-friendly.
- Make sure that inner loop indices correspond to the outermost (last) array index in your data (row-major order).
- Use structure of arrays over array of structures.
- Use the smallest data types that gives the needed precision to maximize potential SIMD width. (If only 16-bits are needed, using ashortrather than anintcan make the difference between 8-way or four-way SIMD parallelism, respectively.)
- Avoid mixing data types to minimize type conversions.
- Avoid operations not supported in SIMD hardware.
- Use all the instruction sets available for your processor. Use the appropriate command line option for your processor type, or select the appropriate IDE option (Windows* only):
- , if your application runs only on Intel® processors.
- , if your application runs on compatible, non-Intel processors.
- Vectorizing compilers usually have some built-in efficiency heuristics to decide whether vectorization is likely to improve performance. TheIntel® oneAPIdisables vectorization of loops with many unaligned or non-unit stride data access patterns. If experimentation reveals that vectorization improves performance, you can override this behavior using theDPC++/C++Compiler#pragma vector alwayshint before the loop; the compiler vectorizes any loop regardless of the outcome of the efficiency analysis (provided, of course, that vectorization is safe).