Outer Loop Vectorization

ID 660014
Updated 3/5/2019
Version Latest



Vectorization Essentials, Outer Loop Vectorization


Outer loop vectorization is a technique to enhance performance. By default, compilers attempt to vectorize innermost loops in nested loop structures. But, in some cases, the number of iterations in the innermost loop is small. In this case, inner-loop vectorization is not profitable. However, if an outer loop contains more work, a combination of SIMD-enabled functions (formerly called elemental functions), strip-mining (loop blocking), and pragma/directive SIMD can force vectorization at this outer, profitable level.


Outer loop vectorization is demonstrated below.

Example loop-nest:

If you have a sparse-matrix-vector pattern as follows:

for(LocalOrdinalType row=ib*BLOCKSIZE; row<top; ++row) {

    for(LocalOrdinalType i=Arowoffsets[row]; i<Arowoffsets[row+1]; ++i) {
      local_y[row] += Acoefs[i]*local_x[Acols[i]];

Vectorization at inner-level vs. outer-level:

You can vectorize the inner i-loop. If you do this default vectorization, one gets unit-stride load for arrays "Acoefs" and "Acols", and gathers for the elements of array "local_x". The "local_y" storage accesses get moved out of the loop and a reduction-temp will be introduced by the compiler in the vectorized loop.

As an alternative, you can vectorize the outer-loop (using SIMD-enabled functions). In this case, for the inner-loop-body the compiler will generate unit-stride load/store for "local_y", "Acoefs" and "Acols" loads become gather, "local_x" continues to be gather.

Why this may be useful:

If the average inner-loop trip-counts are low (not enough to fill up a full vector) and the outer-loop trip-counts are large, then you may get better vectorization-efficiency. The other consideration is: Are there any expensive operations in the outer-loop (say a divide) that now get vectorized due to outer-loop vectorization? The potential gains have to be weighed against the costs of extra gathers/scatters compared to inner-loop vectorization. Plus, specifying alignment becomes harder for outer-loop vectorization.

How to do the outer-loop vectorization:

  1. Add the simd pragma (with the right clauses) to the outer-loop. Or use the corresponding OpenMP* variation, #pragma omp simd and its appropriate clauses.
  2. Directly vectorize at outer loop level by outlining the body of the outer-loop into a SIMD-enabled function and using the simd pragma.
  3. Examples of loop blocking (strip-mining in 2 or more dimensions) are in the Intel C++ Developer Guide.

For the example given above, the outer-loop can be vectorized using the first method simply by adding "#pragma simd" to the outer loop.

Alternatively, it can be vectorized using the second method as follows:

#pragma simd
    for(LocalOrdinalType row=ib*BLOCKSIZE; row<top; ++row) {
       Inner_loop_elem_function(local_y, row, Acoefs, local_x,
                                Acols, Arowoffsets);

__declspec(vector(uniform(Arowoffsets, Acoefs, local_x, Acols, local_y),
Inner_loop_elem_function(float *local_y, int row, float *Acoefs,
                         float *local_x, int *Acols, int *Arowoffsets)
    for(LocalOrdinalType i=Arowoffsets[row]; i<Arowoffsets[row+1]; ++i) {
        local_y[row] += Acoefs[i]*local_x[Acols[i]];

Here is another example of outer-loop vectorization via use of a SIMD-enabled function:

#pragma simd // simd pragma for outer-loop at call-site of SIMD-enabled function
for (int i = beg*16; i < end*16; ++i) {
   particleVelocity_block(px[i], py[i], pz[i], destvx + i, destvy + i,
                          destvz + i, vel_block_start, vel_block_end);

Definition of the SIMD-enabled function:

__declspec(vector(uniform(start,end), linear(velx,vely,velz)))
static void particleVelocity_block(const float posx, const float posy,
                                   const float posz, float *velx,
                                   float *vely, float *velz,
                                   int start, int end) {
    for (int j = start; j < end; ++j) {
        const float del_p_x  = posx - px[j];
        const float del_p_y  = posy - py[j];
        const float del_p_z  = posz - pz[j];
        const float dxn = del_p_x * del_p_x + del_p_y * del_p_y +
                          del_p_z * del_p_z +pa[j]* pa[j];
        const float dxctaui  = del_p_y * tz[j] - ty[j] * del_p_z;
        const float dyctaui  = del_p_z * tx[j] - tz[j] * del_p_x;
        const float dzctaui  = del_p_x * ty[j] - tx[j] * del_p_y;
        const float dst      = 1.0f/std::sqrt(dxn);
        const float dst3     = dst*dst*dst;
        *velx               -= dxctaui * dst3;
        *vely               -= dyctaui * dst3;
        *velz               -= dzctaui * dst3;

Note the use of the uniform and linear clauses on the SIMD-enabled function here. The last two function arguments vel_block_start and vel_block_end are the same for all i, so these two are marked as uniform. The pointer arguments velx, vely, and velz are marked as linear to correspond to the single call-site. 

These pointer variables are also marked as aligned (by the user taking advantage of the alignment properties for the pointers at the call-site) using appropriate clauses inside the SIMD-enabled function.

You can find more examples of outer-loop vectorization described in this paper from ISCA '12, Can Traditional Programming Bridge the Ninja Performance Gap for Parallel Computing Applications? (June 2012)

Take Aways

By default, compilers will attempt to vectorize the innermost loop of a nested loop structure. If there are not enough iterations in this innermost loop, it may not be profitable for the compiler to vectorize that innermost loop. However, if an outer loop has more iterations, vectorization at that outer level may be effective. To do this, a combination of using SIMD-enabled functions and pragma/directive SIMD can move the vectorization to this outer level.


It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon® processors with the Intel® AVX architecture.

Back the main chapter Vectorization Essentials.