Development Reference Guides

Contents

Use Automatic Vectorization

Automatic vectorization is supported on Intel® 64 architectures. The information below will guide you in setting up the auto-vectorizer.

Vectorization Speedup

Where does the vectorization speedup come from? Consider the following sample code, where
a
,
b
, and
c
are integer arrays:
for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];
If vectorization is not enabled, and you compile using the
O1
,
-no-vec-
(Linux), or
/Qvec-
(Windows) option, the compiler processes the code with unused space in the SIMD registers, even though each register can hold three additional integers. If vectorization is enabled (compiled using
O2
or higher options), the compiler may use the additional registers to perform four additions in a single instruction. The compiler looks for vectorization opportunities whenever you compile at default optimization (
O2
) or higher.
This option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel® microprocessors than on non-Intel microprocessors.
To get details about the type of loop transformations and optimizations that took place, use the
[Q]opt-report-phase
option by itself or along with the
[Q]opt-report
option.
Linux
To evaluate performance enhancement, run
vec_samples
:
  1. Source an environment script such as
    vars.sh
    in the
    <installdir>
    directory and use the attribute appropriate for the architecture.
  2. Navigate to the
    <installdir>\Samples\<locale>\C++\
    directory. This application multiplies a vector by a matrix using the following loop:
    for (j = 0;j < size2; j++) { b[i] += a[i][j] * x[j]; }
  3. Build and run the application, first without enabling auto-vectorization. The default
    O2
    optimization enables vectorization, so you need to disable it with a separate option.
    icx -O2 -no-vec Multiply.c -o NoVectMult ./NoVectMult
  4. Build and run the application, this time with auto-vectorization.
    icx -O2 -qopt-report=3 -vec Multiply.c -o VectMult ./VectMult
Windows
To evaluate performance enhancement, run
vec_samples
:
  1. Under the
    Start
    menu item for your product, select an icon under
    Intel oneAPI <version>
    Intel oneAPI Command Prompt
    for oneAPI Compilers.
  2. Navigate to the
    <installdir>\Samples\<locale>\C++\
    directory. On Windows, unzip the sample project
    vec_samples.zip
    to a writable directory. This small application multiplies a vector by a matrix using the following loop:
    for (j = 0;j < size2; j++) { b[i] += a[i][j] * x[j]; }
  3. Build and run the application, first without enabling auto-vectorization. The default
    O2
    optimization enables vectorization, so you need to disable it with a separate option.
    icx /O2 /Qvec- Multiply.c /FeNoVectMult NoVectMult
  4. Build and run the application, this time with auto-vectorization.
    icx /O2 /Qopt-report:3 /Qvec Multiply.c /FeVectMult VectMult
When you compare the timing of the two runs, you may see that the vectorized version runs faster. The time for the non-vectorized version is only slightly faster than would be obtained by compiling with the
O1
option.

Obstacles to Vectorization

The following issues do not always prevent vectorization, but frequently cause the compiler to decide that vectorization would not be worthwhile.
  • Non-contiguous memory access:
    Four consecutive integers or floating-point values, or two consecutive doubles, may be loaded directly from memory in a single SSE instruction. But if the four integers are not adjacent, they must be loaded separately using multiple instructions, which is considerably less efficient. The most common examples of non-contiguous memory access are loops with non-unit stride or with indirect addressing, shown in the examples below. The compiler rarely vectorizes these loops, unless the amount of computational work is larger compared to the overhead from non-contiguous memory access.
    // arrays accessed with stride 2 for (int i=0; i<SIZE; i+=2) b[i] += a[i] * x[i]; // inner loop accesses a with stride SIZE for (int j=0; j<SIZE; j++) { for (int i=0; i<SIZE; I++) b[i] += a[i][j] * x[j]; } // indirect addressing of x using index array for (int i=0; i<SIZE; i+=2) b[i] += a[i] * x[index[i]];
    The typical message from the vectorization report is:
    vectorization possible but seems inefficient
    , although indirect addressing may also result in the following report:
    existence of vector dependence
    .
  • Data dependencies:
    Vectorization entails changes in the order of operations within a loop, since each SIMD instruction operates on several data elements at once. Vectorization is only possible if this change of order does not change the results of the calculation.
    • The simplest case is when data elements that are written (stored to) do not appear in any other iteration of the individual loop. In this case, all the iterations of the original loop are independent of each other, and can be executed in any order, without changing the result. The loop may be safely executed using any parallel method, including vectorization.
    • When a variable is written in one iteration and read in a subsequent iteration, there is a read-after-write dependency, also known as a flow dependency, for example:
      A[0]=0; for (j=1; j<MAX; j++) A[j]=A[j-1]+1; // this is equivalent to: A[1]=A[0]+1; A[2]=A[1]+1; A[3]=A[2]+1; A[4]=A[3]+1;
      The value of
      j
      is propagated to all
      A[j]
      . This cannot safely be vectorized: if the first two iterations are executed simultaneously by a SIMD instruction, the value of
      A[1]
      is used by the second iteration before it has been calculated by the first iteration.
    • When a variable is read in one iteration and written in a subsequent iteration, this is a write-after-read dependency, also known as an anti-dependency, for example:
      for (j=1; j<MAX; j++) A[j-1]=A[j]+1; // this is equivalent to: A[0]=A[1]+1; A[1]=A[2]+1; A[2]=A[3]+1; A[3]=A[4]+1;
      This write-after-read dependency is not safe for general parallel execution, since the iteration with the write may execute before the iteration with the read. No iteration with a higher value of
      j
      can complete before an iteration with a lower value of
      j
      , and so vectorization is safe (it gives the same result as non-vectorized code).
      The following example may not be safe, since vectorization might cause some elements of
      A
      to be overwritten by the first SIMD instruction before being used for the second SIMD instruction.
      for (j=1; j<MAX; j++) { A[j-1]=A[j]+1; } // this is equivalent to: A[0]=A[1]+1; A[1]=A[2]+1; A[2]=A[3]+1; A[3]=A[4]+1;
    • Read-after-read situations are not really dependencies, and do not prevent vectorization or parallel execution. If a variable is unwritten, it does not matter how often it is read.
    • Write-after-write, or output dependencies, where the same variable is written to in more than one iteration, are generally unsafe for parallel execution, including vectorization.
    • One important exception that contains all of the above types of dependency is:
      sum=0; for (j=1; j<MAX; j++) sum = sum + A[j]*B[j]
      Although
      sum
      is both read and written in every iteration, the compiler recognizes such reduction idioms, and is able to vectorize them safely. The loop in the first example was another example of a reduction, with a loop-invariant array element in place of a scalar.
      These types of dependencies between loop iterations are sometimes known as loop-carried dependencies.
      The above examples are of proven dependencies. The compiler cannot safely vectorize a loop if there is even a potential dependency. For example:
      for (i = 0; i < size; i++) { c[i] = a[i] * b[i]; }
      In the above example, the compiler needs to determine whether, for some iteration
      i
      ,
      c[i]
      might refer to the same memory location as
      a[i]
      or
      b[i]
      for a different iteration. Such memory locations are sometimes said to be aliased. For example, if
      a[i]
      pointed to the same memory location as
      c[i-1]
      , there would be a read-after-write dependency. If the compiler cannot exclude this possibility, it will not vectorize the loop unless you provide the compiler with hints.

Help the Compiler Vectorize

Sometimes the compiler has insufficient information to decide to vectorize a loop. There are several ways to provide additional information to the compiler:
  • Pragmas:
    • #pragma ivdep:
      may be used to tell the compiler that it may safely ignore any potential data dependencies. (The compiler will not ignore proven dependencies). Use of this pragma when there are dependencies may lead to incorrect results.
      There are cases where the compiler cannot tell by a static dependency analysis that it is safe to vectorize. Consider the following loop:
      void copy(char *cp_a, char *cp_b, int n) { for (int i = 0; i < n; i++) { cp_a[i] = cp_b[i]; } }
      Without more information, a vectorizing compiler must conservatively assume that the memory regions accessed by the pointer variables
      cp_a
      and
      cp_b
      may (partially) overlap, which can cause potential data dependencies that prohibit straightforward conversion of this loop into SIMD instructions. At this point, the compiler may decide to keep the loop serial or generate a runtime test for overlap, where the loop in the true-branch can be converted into SIMD instructions:
      if (cp_a + n < cp_b || cp_b + n < cp_a) /* vector loop */ for (int i = 0; i < n; i++) cp_a[i] = cp_b [I]; else /* serial loop */ for (int i = 0; i < n; i++) cp_a[i] = cp_b[i];
      Runtime data-dependency testing provides a way to exploit implicit parallelism in C or C++ code at the expense of a slight increase in code size and testing overhead. If the function copy is only used in specific ways, you can help the compiler:
      • If the function is mainly used for small values of
        n
        or for overlapping memory regions, you can prevent vectorization and the corresponding runtime overhead by inserting a
        #pragma novector
        hint before the loop.
      • Conversely, if the loop is guaranteed to operate on non-overlapping memory regions, you can provide this information to the compiler by means of a
        #pragma ivdep
        hint before the loop. This tells the compiler that conservatively assumed data dependencies that prevent vectorization can be ignored and results in vectorization of the loop without runtime data-dependency testing.
        #pragma ivdep void copy(char *cp_a, char *cp_b, int n) { for (int i = 0; i < n; i++) { cp_a[i] = cp_b[i]; } }
      You can also use the
      restrict
      keyword.
    • #pragma loop count (n):
      gives the typical trip count of the loop. This helps the compiler decide if vectorization is worthwhile, or if it should generate alternative code paths for the loop.
    • #pragma vector always:
      asks the compiler to vectorize the loop.
    • #pragma vector align:
      asserts that data within the following loop is aligned (to a 16-byte boundary, for Intel® SSE instruction sets).
    • #pragma novector:
      asks the compiler not to vectorize a particular loop.
    • #pragma vector nontemporal:
      gives a hint to the compiler that data will not be reused, and to use streaming stores that bypass cache.
  • Keywords:
    The
    restrict
    keyword is used to assert that the memory referenced by a pointer is not aliased. The keyword requires the use of the
    [Q]
    std=c99
    compiler option. The example under
    #pragma ivdep
    above can also be handled using the
    restrict
    keyword.
    You may use the
    restrict
    keyword in the declarations of
    cp_a
    and
    cp_b
    , as shown below, to inform the compiler that each pointer variable provides exclusive access to a certain memory region. The
    restrict
    qualifier in the argument list lets the compiler know that there are no other aliases to the memory where the pointers point. The pointer where it is used provides the only means of accessing the memory in the scope where the pointers live. Even if the code gets vectorized without the
    restrict
    keyword, the compiler checks for aliasing at runtime, if the
    restrict
    keyword was used.
    void copy(char * __restrict cp_a, char * __restrict cp_b, int n) { for (int i = 0; i < n; i++) cp_a[i] = cp_b[i]; }
    This method is best used when the exclusive access property holds for the pointer variables in your code with many loops, because it avoids annotating each of the vectorizable loops individually. Both the loop-specific
    #pragma ivdep
    hint, and the pointer variable-specific
    restrict
    hint must be used with care because incorrect usage may change the semantics intended in the original program.
    Another example is the following loop that may also not get vectorized because of a potential aliasing problem between pointers
    a
    ,
    b
    , and
    c
    :
    void add(float *a, float *b, float *c) { for (int i=0; i<SIZE; i++) { c[i] += a[i] + b[i]; } }
    If the
    restrict
    keyword is added to the parameters, the compiler assumes that you will not access the memory in question with any other pointer and vectorize the code properly:
    // let the compiler know, the pointers are safe with restrict void add(float * __restrict a, float * __restrict b, float * __restrict c) { for (int i=0; i<SIZE; i++) { c[i] += a[i] + b[i]; } }
    The down-side of using
    restrict
    is that not all compilers support this keyword, so your source code may lose portability.
  • Options/switches:
    You can use options to enable different levels of optimizations to achieve automatic vectorization:
    • Interprocedural optimization (IPO):
      Enable IPO using the
      [Q]ipo
      option across source files. You provide the compiler with additional information (trip counts, alignment, or data dependencies) about a loop. Enabling IPO may also allow inlining of function calls.
    • High-level optimizations (HLO):
      Enable HLO with option
      O3
      . This enables additional loop optimizations that make it easier for the compiler to vectorize the transformed loops.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.