Intel® Advisor User Guide

ID 766448
Date 11/07/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Vectorization Recommendations for C++

Ineffective Peeled/Remainder Loop(s) Present

All or some source loop iterations are not executing in the loop body. Improve performance by moving source loop iterations from peeled/remainder loops to the loop body.

Align Data

One of the memory accesses in the source loop does not start at an optimally aligned address boundary. To fix: Align the data and tell the compiler the data is aligned.

Align dynamic data using a 64-byte boundary and tell the compiler the data is aligned:

float *array;
array = (float *)_mm_malloc(ARRAY_SIZE*sizeof(float), 32);
// Somewhere else
__assume_aligned(array, 32);
// Use array in loop
_mm_free(array);

Align static data using a 64-byte boundary:

__declspec(align(64)) float array[ARRAY_SIZE]

See also:

Parallelize The Loop with Both Threads and SIMD Instructions

The loop is threaded and auto-vectorized; however, the trip count is not a multiple of vector length. To fix: Do all of the following:

  • Use the #pragma omp parallel for simd directive to parallelize the loop with both threads and SIMD instructions. Specifically, this directive divides loop iterations into chunks (subsets) and distributes the chunks among threads, then chunk iterations execute concurrently using SIMD instructions.
  • Add the schedule(simd: [kind]) modifier to the directive to guarantee the chunk size (number of iterations per chunk) is a multiple of vector length.

Original code sample:

void f(int a[], int b[], int c[])
{
    #pragma omp parallel for schedule(static)
    for (int i = 0; i < n; i++)
    {
        a[i] = b[i] + c[i];
    }
}

Revised code sample:

void f(int a[], int b[], int c[])
{
    #pragma omp parallel for simd schedule(simd:static)
    for (int i = 0; i < n; i++)
    {
        a[i] = b[i] + c[i];
    }
}

See also:

Force Scalar Remainder Generation

The compiler generated a masked vectorized remainder loop that contains too few iterations for efficient vector processing. A scalar loop may be more beneficial. To fix: Force scalar remainder generation using a directive: #pragma vector novecremainder.

void add_floats(float *a, float *b, float *c, float *d, float *e, int n)
{
    int i;
    // Force the compiler to not vectorize the remainder loop
    #pragma vector novecremainder
    for (i=0; i<n; i++)
    {
        a[i] = a[i] + b[i] + c[i] + d[i] + e[i];
    }
}

See also:

Force Vectorized Remainder

The compiler did not vectorize the remainder loop, even though doing so could improve performance. To fix: Force vectorization using a directive: #pragma vector vecremainder.

void add_floats(float *a, float *b, float *c, float *d, float *e, int n)
{
    int i;
    // Force the compiler to vectorize the remainder loop
    #pragma vector vecremainder
    for (i=0; i<n; i++)
    {
        a[i] = a[i] + b[i] + c[i] + d[i] + e[i];
    }
}

See also:

Specify The Expected Loop Trip Count

The compiler cannot detect the trip count statically. To fix: Specify the expected number of iterations using a directive: #pragma loop_count.

#include <stdio.h>

int mysum(int start, int end, int a)
{
    int iret=0;
    // Iterate through a loop a minimum of three, maximum of ten, and average of five times
    #pragma loop_count min(3), max(10), avg(5)
    for (int i=start;i<=end;i++)
        iret += a;
    return iret;
}

int main()
{
    int t;
    t = mysum(1, 10, 3);
    printf("t1=%d\r\n",t);
    t = mysum(2, 6, 2);
    printf("t2=%d\r\n",t);
    t = mysum(5, 12, 1);
    printf("t3=%d\r\n",t);
}

See also:

Change The Chunk Size

The loop is threaded and vectorized using the #pragma omp parallel for simd directive, which parallelizes the loop with both threads and SIMD instructions. Specifically, the directive divides loop iterations into chunks (subsets) and distributes the chunks among threads, then chunk iterations execute concurrently using SIMD instructions. In this case, the chunk size (number of iterations per chunk) is not a multiple of vector length. To fix: Add a schedule (simd: [kind]) modifier to the #pragma omp parallel for simd directive.

void f(int a[], int b[], int[c])
{
    // Guarantee a multiple of vector length.
    #pragma omp parallel for simd schedule(simd: static)
    for (int i = 0; i < n; i++)
    {
        a[i] = b[i] + c[i];
    }
}

See also:

Add Data Padding

The trip count is not a multiple of vector length . To fix: Do one of the following:

  • Increase the size of objects and add iterations so the trip count is a multiple of vector length.
  • Increase the size of static and automatic objects, and use a compiler option to add data padding.

See also:

Collect Trip Counts Data

The Survey Report lacks trip counts data that might generate more precise recommendations.

Disable Unrolling

The trip count after loop unrolling is too small compared to the vector length . To fix: Prevent loop unrolling or decrease the unroll factor using a directive: #pragma nounroll or #pragma unroll.

void nounroll(int a[], int b[], int c[], int d[])
{
    // Disable automatic loop unrolling using
    #pragma nounroll
    for (int i = 1; i < 100; i++)
    {
        b[i] = a[i] + 1;
        d[i] = c[i] + 1;
    }
}

See also:

Use A Smaller Vector Length

The compiler chose a vector length of , but the trip count might be smaller than the vector length. To fix: Specify a smaller vector length using a directive: #pragma omp simd simdlen.

void f(int a[], int b[], int c[], int d[])
{
    // Specify vector length using
    #pragma omp simd simdlen(4)
    for (int i = 1; i < 100; i++)
    {
        b[i] = a[i] + 1;
        d[i] = c[i] + 1;
    }
}

In Intel Compiler version 19.0 and higher, there is a new vector length clause that allows the compiler to choose the best vector length based on cost: #pragma vector vectorlength(vl1, vl2, ..., vln) where vl is an integer power of 2.

void f(int a[], int b[], int c[], int d[])
{
    // Specify list of vector lengths
    #pragma vector vectorlength(2, 4, 16)
    for (int i = 1; i < 100; i++)
    {
        b[i] = a[i] + 1;
        d[i] = c[i] + 1;
    }
}

See also:

Disable Dynamic Alignment

The compiler automatically peeled iterations from the vector loop into a scalar loop to align the vector loop with a particular memory reference; however, this optimization may not be ideal. To possibly achieve better performance, disable automatic peel generation using the directive: #pragma vector nodynamic_align.

...
#pragma vector nodynamic_align
for (int i = 0; i < len; i++)
...
void f(float * a, float * b, float * c, int len)
{
    #pragma vector nodynamic_align
    for (int i = 0; i < len; i++)
    {
        a[i] = b[i] * c[i];
    }
}

See also:

Serialized User Function Call(s) Present

User-defined functions in the loop body are not vectorized.

Enable Inline Expansion

Inlining of user-defined functions is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.

Windows* OS

Linux* OS

/Ob1 or /Ob2 -inline-level=1 or -inline-level=2

See also: