Intel® Advisor User Guide

ID 766448
Date 12/16/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Vectorization Recommendations for Fortran

Ineffective Peeled/Remainder Loop(s) Present

All or some source loop iterations are not executing in the loop body. Improve performance by moving source loop iterations from peeled/remainder loops to the loop body.

Align Data

One of the memory accesses in the source loop does not start at an optimally aligned address boundary. To fix: Align the data and tell the compiler the data is aligned. To align data, use __declspec(align()). To tell the compiler the data is aligned, use __assume_aligned() before the source loop.

See also:

Parallelize The Loop with Both Threads and SIMD Instructions

The loop is threaded and auto-vectorized; however, the trip count is not a multiple of vector length. To fix: Do all of the following:

  • Use the !$omp parallel do simd directive to parallelize the loop with both threads and SIMD instructions. Specifically, this directive divides loop iterations into chunks (subsets) and distributes the chunks among threads, then chunk iterations execute concurrently using SIMD instructions.
  • Add the schedule(simd: [kind]) modifier to the directive to guarantee the chunk size (number of iterations per chunk) is a multiple of vector length.

Original code sample:

!$omp parallel do schedule(static)
do i = 1,1000
    c(i) = a(i)*b(i)
end do
!$omp end parallel do

Revised code sample:

!$omp parallel do simd schedule(simd: static)
do i = 1,1000
    c(i) = a(i)*b(i)
end do
!$omp end parallel do simd

See also:

Force Scalar Remainder Generation

The compiler generated a masked vectorized remainder loop that contains too few iterations for efficient vector processing. A scalar loop may be more beneficial. To fix: Force scalar remainder generation using a directive: !DIR$ VECTOR NOVECREMAINDER.

subroutine add(A, N, X)
    integer N, X
    real    A(N)
    ! Force the compiler to not vectorize the remainder loop
    !DIR$ VECTOR NOVECREMAINDER
    do i=x+1, n
        a(i) = a(i) + a(i-x)
    enddo
end

See also:

Force Vectorized Remainder

The compiler did not vectorize the remainder loop, even though doing so could improve performance. To fix: Force vectorization using a directive: !DIR$ VECTOR VECREMAINDER.

subroutine add(A, N, X)
    integer N, X
    real    A(N)
    ! Force the compiler to vectorize the remainder
    !DIR$ VECTOR VECREMAINDER
    do i=x+1, n
        a(i) = a(i) + a(i-x)
    enddo
end

See also:

Specify The Expected Loop Trip Count

The compiler cannot detect the trip count statically. To fix: Specify the expected number of iterations using a directive: !DIR$ LOOP COUNT.

Iterate through a loop a maximum of ten, minimum of three, and average of five times:

!DIR$ LOOP COUNT MAX(10), MIN(3), AVG(5)
do i =1, m
    b(i) = a(i) + 1
    d(i) = c(i) + 1
enddo

See also:

Change The Chunk Size

The loop is threaded and vectorized using the !$omp parallel for simd directive, which parallelizes the loop with both threads and SIMD instructions. Specifically, the directive divides loop iterations into chunks (subsets) and distributes the chunks among threads, then chunk iterations execute concurrently using SIMD instructions. In this case, the chunk size (number of iterations per chunk) is not a multiple of vector length. To fix: Add a schedule (simd: [kind]) modifier to the !$omp parallel for simd directive.

Guarantee a maximum vector length.

!$omp parallel do simd schedule(simd: static)
do i = 1,1000
    c(i) = a(i)*b(i)
end do
!$omp end parallel do simd

See also:

Add Data Padding

The trip count is not a multiple of vector length . To fix: Do one of the following:

  • Increase the size of objects and add iterations so the trip count is a multiple of vector length.
  • Increase the size of static and automatic objects, and use a compiler option to add data padding.

See also:

Collect Trip Counts Data

The Survey Report lacks trip counts data that might generate more precise recommendations.

Disable Unrolling

The trip count after loop unrolling is too small compared to the vector length . To fix: Prevent loop unrolling or decrease the unroll factor using a directive: !DIR$ NOUNROLL or !DIR$ UNROLL.

Disable automatic loop unrolling using !DIR$ NOUNROLL.

!DIR$ NOUNROLL
do i = 1, m
    b(i) = a(i) + 1
    d(i) = c(i) + 1
enddo

See also:

Use A Smaller Vector Length

The compiler chose a vector length of , but the trip count might be smaller than the vector length. To fix: Specify a smaller vector length using a directive: !$OMP SIMD SIMDLEN.

!$OMP SIMD SIMDLEN(4)
do i = 1, m
    b(i) = a(i) + 1
    d(i) = c(i) + 1
enddo

In Intel Compiler version 19.0 and higher, there is a new vector length clause that allows the compiler to choose the best vector length based on cost: !DIR$ VECTOR VECTORLENGTH (vl1, vl2, ..., vln) where vl is an integer power of 2.

!DIR$ VECTOR VECTORLENGTH(2, 4, 16)
do i = 1, m
    b(i) = a(i) + 1
    d(i) = c(i) + 1
enddo

See also:

Disable Dynamic Alignment

The compiler automatically peeled iterations from the vector loop into a scalar loop to align the vector loop with a particular memory reference; however, this optimization may not be ideal. To possibly achieve better performance, disable automatic peel generation using the directive: !DIR$ VECTOR NODYNAMIC_ALIGN.

...
!DIR$ VECTOR NODYNAMIC_ALIGN
do i = 1, len
    a(i) = b(i) * c(i)
enddo

See also:

Serialized User Function Call(s) Present

User-defined functions in the loop body are not vectorized.

Enable Inline Expansion

Inlining of user-defined functions is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.

[/topic/body/section/sectiondiv/table/tgroup/thead/row/entry {"col1"})

Windows* OS

(entry]
[/topic/body/section/sectiondiv/table/tgroup/thead/row/entry {"col2"})

Linux* OS

(entry]
[/topic/body/section/sectiondiv/table/tgroup/tbody/row/entry {"col1"}) /Ob1 or /Ob2 (entry][/topic/body/section/sectiondiv/table/tgroup/tbody/row/entry {"col2"}) -inline-level=1 or -inline-level=2 (entry]

See also:

Vectorize Serialized Function(s) Inside Loop

  • Enforce vectorization of the source loop by means of SIMD instructions and/or create a SIMD version of the function(s) using a directive:
    Target Directive
    Source Loop !$OMP SIMD
    Inner function definition or declaration !$OMP DECLARE SIMD
  • If using the Ob or inline-level compiler option to control inline expansion with the 1 argument, use an inline keyword to enable inlining or replace the 1 argument with 2 to enable inlining of any function at compiler discretion.

real function f (x)
    !DIR$ OMP DECLARE SIMD
    real, intent(in), value  :: x
    f= x + 1
end function f

!DIR$ OMP SIMD
do k = 1, N
    a(k) = f(k)
enddo

See also:

Scalar Math Function Call(s) Present

Math functions in the loop body are preventing the compiler from effectively vectorizing the loop. Improve performance by enabling vectorized math call(s).

Enable Inline Expansion

Inlining is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.

[/topic/body/section/sectiondiv/table/tgroup/thead/row/entry {"col1"})

Windows* OS

(entry]
[/topic/body/section/sectiondiv/table/tgroup/thead/row/entry {"col2"})

Linux* OS

(entry]
[/topic/body/section/sectiondiv/table/tgroup/tbody/row/entry {"col1"}) /Ob1 or /Ob2 (entry][/topic/body/section/sectiondiv/table/tgroup/tbody/row/entry {"col2"}) -inline-level=1 or -inline-level=2 (entry]

Alternatively use #include <mathimf.h> header instead of the standard #include <math.h> header to call highly optimized and accurate mathematical functions commonly used in applications that rely heaving on floating point computations.

See also:

Vectorize Math Function Calls Inside Loops

Your application calls serialized versions of math functions when you use the precise floating point model. To fix: Do one of the following:

  • Add fast-transcendentals compiler option to replace calls to transcendental functions with faster calls.
    [/topic/body/section/sectiondiv/p/ul/li/table/tgroup/thead/row/entry {"col1"})

    Windows* OS

    (entry]
    [/topic/body/section/sectiondiv/p/ul/li/table/tgroup/thead/row/entry {"col2"})

    Linux* OS

    (entry]
    [/topic/body/section/sectiondiv/p/ul/li/table/tgroup/tbody/row/entry {"col1"}) /Qfast-transcendentals (entry][/topic/body/section/sectiondiv/p/ul/li/table/tgroup/tbody/row/entry {"col2"}) -fast-transcendentals (entry]
    CAUTION:
    This may reduce floating point accuracy.
  • Enforce vectorization of the source loop using a directive: !$OMP SIMD
subroutine add(A, N, X)
    integer N, X
    real    A(N)
    !DIR$ OMP SIMD
    do i=x+1, n
        a(i) = a(i) + a(i-x)
    enddo
end

See also:

Change The Floating Point Model

Your application calls serialized versions of math functions when you use the strict floating point model. To fix: Do one of the following:

  • Use the fast floating point model to enable more aggressive optimizations or the precise floating point model to disable optimizations that are not value-safe on fast transcendental functions.
    [/topic/body/section/sectiondiv/p/ul/li/table/tgroup/thead/row/entry {"col1"})

    Windows* OS

    (entry]
    [/topic/body/section/sectiondiv/p/ul/li/table/tgroup/thead/row/entry {"col2"})

    Linux* OS

    (entry]
    [/topic/body/section/sectiondiv/p/ul/li/table/tgroup/tbody/row/entry {"col1"}) /fp:fast (entry][/topic/body/section/sectiondiv/p/ul/li/table/tgroup/tbody/row/entry {"col2"}) -fp-model fast (entry][/topic/body/section/sectiondiv/p/ul/li/table/tgroup/tbody/row/entry {"col1"}) /fp:precise /Qfast-transcendentals (entry][/topic/body/section/sectiondiv/p/ul/li/table/tgroup/tbody/row/entry {"col2"}) -fp-model precise -fast-transcendentals (entry]
    CAUTION:
    This may reduce floating point accuracy.
  • Use the precise floating point model and enforce vectorization of the source loop using a directive: !$OMP SIMD
gfortran program.for -O2 -fopenmp -fp-model precise -fast-transcendentals
!DIR$ OMP SIMD COLLAPSE(2)
do i = 1, N
    a(i) = b(i) * c(i)
    do j = 1, N
        d(j) = e(j) * f(j)
    enddo
enddo

See also:

Use a Glibc Library with Vectorized SVML Functions

Your application calls scalar instead of vectorized versions of math functions. To fix: Do all of the following:

  • Upgrade the Glibc library to version 2.22 or higher. It supports SIMD directives in OpenMP* 4.0 or higher.
  • Upgrade the GNU* gcc compiler to version 4.9 or higher. It supports vectorized math function options.
  • Use the -fopenmp and -ffast-math compiler options to enable vector math functions.
  • Use appropriate OpenMP SIMD directives to enable vectorization.
NOTE:
Also use the -I/path/to/glibc/install/include and -L/path/to/glibc/install/lib compiler options if you have multiple Glibc libraries installed on the host.
gfortran PROGRAM.FOR -O2 -fopenmp -ffast-math -lrt -lm -mavx2
program main
    parameter (N=100000000)
    real*8 angles(N), results(N)
    integer i
    call srand(86456)

    do i=1,N
        angles(i) = rand()
    enddo

    !$OMP SIMD
    do i=1,N
        results(i) = cos(angles(i))
    enddo

end

See also:

Use The Intel Short Vector Math Library for Vector Intrinsics

Your application calls scalar instead of vectorized versions of math functions. To fix: Do all of the following:

  • Use the -mveclibabi=svml compiler option to specify the Intel short vector math library ABI type for vector instrinsics.
  • Use the -ftree-vectorize and -funsafe-math-optimizations compiler options to enable vector math functions.
  • Use the -L/path/to/intel/lib and -lsvml compiler options to specify an SVML ABI-compatible library at link time.
gfortran PROGRAM.FOR -O2 -ftree-vectorize -funsafe-math-optimizations -mveclibabi=svml -L/opt/intel/lib/intel64 -lm -lsvml -Wl,-rpath=/opt/intel/lib/intel64
program main
    parameter (N=100000000)
    real*8 angles(N), results(N)
    integer i
    call srand(86456)

    do i=1,N
        angles(i) = rand()
    enddo

    ! the loop will be auto-vectorized
    do i=1,N
        results(i) = cos(angles(i))
    enddo

end

See also:

Inefficient Gather/Scatter Instructions Present

The compiler assumes indirect or irregular stride access to data used for vector operations. Improve memory access by alerting the compiler to detected regular stride access patterns, such as:

[/topic/body/section/p/table/tgroup/thead/row/entry {"col1"})

Pattern

(entry]
[/topic/body/section/p/table/tgroup/thead/row/entry {"col2"})

Description

(entry]
[/topic/body/section/p/table/tgroup/tbody/row/entry {"col1"}) Invariant (entry][/topic/body/section/p/table/tgroup/tbody/row/entry {"col2"}) The instruction accesses values in the same memory throughout the loop. (entry][/topic/body/section/p/table/tgroup/tbody/row/entry {"col1"}) Uniform (Horizontal Invariant) (entry][/topic/body/section/p/table/tgroup/tbody/row/entry {"col2"}) The instruction accesses values in the same memory within the vector iteration. (entry][/topic/body/section/p/table/tgroup/tbody/row/entry {"col1"}) Vertical Invariant (entry][/topic/body/section/p/table/tgroup/tbody/row/entry {"col2"}) The instruction accesses the memory locations using the same offset across all vector iterations. (entry][/topic/body/section/p/table/tgroup/tbody/row/entry {"col1"}) Unit (entry][/topic/body/section/p/table/tgroup/tbody/row/entry {"col2"}) The instruction accesses values in contiguous memory throughout the loop, and the stride between vector iterations = vector length. (entry][/topic/body/section/p/table/tgroup/tbody/row/entry {"col1"}) Constant (Non-Unit) (entry][/topic/body/section/p/table/tgroup/tbody/row/entry {"col2"}) The instruction accesses the memory locations using the same stride between iterations. (entry]

Refactor code with detected regular stride access patterns

The Memory Access Patterns Report shows the following regular stride access(es):

[/topic/body/section/sectiondiv/p/table/tgroup/thead/row/entry {"col1"})

Variable

(entry]
[/topic/body/section/sectiondiv/p/table/tgroup/thead/row/entry {"col2"})

Pattern

(entry]
[/topic/body/section/sectiondiv/p/table/tgroup/tbody/row/entry {"col1"}) block 0x2b05c877040 allocated at main.cpp:14 (entry][/topic/body/section/sectiondiv/p/table/tgroup/tbody/row/entry {"col2"}) Unit (entry][/topic/body/section/sectiondiv/p/table/tgroup/tbody/row/entry {"col1"}) block 0x2b05c877040 allocated at main.cpp:14 (entry][/topic/body/section/sectiondiv/p/table/tgroup/tbody/row/entry {"col2"}) Constant (Non-Unit) (entry]

See details in the Memory Access Patterns Report Source Details view.

To improve memory access: Refactor your code to alert the compiler to a regular stride access. Sometimes, it might be beneficial to use the ipo/Qipo compiler option to enable interprocedural optimization (IPO) between files.

See also:

Vector Register Spinning Possible

Possible register spilling was detected and all vector registers are in use. This may negatively impact performance, because the spilled variable must be loaded to and unloaded from main memory. Improve performance by decreasing vector register pressure.

Decrease Unroll Factor

The current directive unroll factor increases vector register pressure. To fix: Decrease unroll factor using a directive: !DIR$ NOUNROLL or !DIR$ UNROLL.

!DIR$ UNROLL
do i = 1, m
    b(i) = a(i) + 1
    d(i) = c(i) + 1
enddo

See also:

Split Loop into Smaller Loops

Possible register spilling along with high vector register pressure is preventing effective vectorization. To fix: Use the directive !DIR$ DISTRIBUTE POINT or rewrite your code to distribute the source loop. This can decrease register pressure as well as enable software pipelining and improve both instruction and data cache use.

!DIR$ DISTRIBUTE POINT
do i = 1, m
    b(i) = a(i) + 1
    ...
    c(i) = a(i) + b(i) ! Compiler will decide
    ! where to distribute.
    ! Data dependencies are observed
    ...
    d(i) = c(i) + 1
enddo
do i =1, m
    b(i) = a(i) + 1
    ...
    !DIR$ DISTRIBUTE POINT
    call sub(a, n)! Distribution will start here,
    ! ignoring all loop-carried depedencies
    c(i) = a(i) + b(i)
    ...
    d(i) = c(i) + 1
enddo

See also:

Assumed Dependency Present

The compiler assumed there is an anti-dependency (Write after read - WAR) or a true dependency (Read after write - RAW) in the loop. Improve performance by investigating the assumption and handling accordingly.

Confirm Dependency Is Real

There is no confirmation that a real (proven) dependency is present in the loop. To confirm: Run a Dependencies analysis.

Enable Vectorization

The Dependencies analysis shows there is no real dependency in the loop for the given workload. Tell the compiler it is safe to vectorize using the restrict keyword or a directive:

[/topic/body/section/sectiondiv/table/tgroup/thead/row/entry {"col1"})

Target

(entry]
[/topic/body/section/sectiondiv/table/tgroup/thead/row/entry {"col2"})

Directive

(entry]
[/topic/body/section/sectiondiv/table/tgroup/tbody/row/entry {"col1"}) !$OMP SIMD (entry][/topic/body/section/sectiondiv/table/tgroup/tbody/row/entry {"col2"}) Ignores all dependencies in the loop. (entry][/topic/body/section/sectiondiv/table/tgroup/tbody/row/entry {"col1"}) !DIR$ IVDEP (entry][/topic/body/section/sectiondiv/table/tgroup/tbody/row/entry {"col2"}) Ignores only vector dependencies (which is the safest) (entry]
!DIR$ IVDEP
do i = 1, N-4, 4
    a(i+4) = b(i) * c
enddo

See also:

Proven (Real) Dependency Is Present

The compiler assumed there is an anti-dependency (Write after read - WAR) or true dependency (Read after write - RAW) in the loop. Improve performance by investigating the assumption and handling accordingly.

Resolve Dependency

The Dependencies analysis shows there is a real (proven) dependency in the loop. To fix: Do one of the following:

  • If there is an anti-dependency, enable vectorization using the directive !$OMP SIMD SAFELEN(length) , where length is smaller than the distance between dependent iterations in anti-dependency.
    !$OMP SIMD SAFELEN(4)
    do i = 1, N-4, 4
        a(i+4) = b(i) * c
    enddo
  • If there is a reduction pattern dependency in the loop, enable vectorization using the directive !$OMP SIMD REDUCTION(operator:list).
    !$OMP SIMD REDUCTION(+:SUMX)
    do k = 1, size2
        sumx = sumx + x(k) * b(k)
    enddo
  • Rewrite the code to remove the dependency. Use programming techniques such as variable privatization.

See also:

Data Type Conversions Present

There are multiple data types within loops. Utilize hardware vectorization support more effectively by avoiding data type conversion.

Use The Smallest Data Type

The source loop contains data types of different widths. To fix: Use the smallest data type that gives the needed precision to use the entire vector register width.

Example: If only 16-bits are needed, using a short rather than an int can make the difference between eight-way or four-way SIMD parallelism, respectively.

User Function Call(s) Present

User-defined functions in the loop body are preventing the compiler from vectorizing the loop.

Enable Inline Expansion

Inlining of user-defined functions is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.

[/topic/body/section/sectiondiv/table/tgroup/thead/row/entry {"col1"})

Windows* OS

(entry]
[/topic/body/section/sectiondiv/table/tgroup/thead/row/entry {"col2"})

Linux* OS

(entry]
[/topic/body/section/sectiondiv/table/tgroup/tbody/row/entry {"col1"}) /Ob1 or /Ob2 (entry][/topic/body/section/sectiondiv/table/tgroup/tbody/row/entry {"col2"}) -inline-level=1 or -inline-level=2 (entry]

See also:

Vectorize User Function(s) Inside Loop

These user-defined function(s) are not vectorized or inlined by the compiler: my_calc() To fix: Do one of the following:

  • Enforce vectorization of the source loop by means of SIMD instructions and/or create a SIMD version of the function(s) using a directive:
    Target Directive
    Source loop !$OMP SIMD
    Inner function definition or declaration !$OMP DECLARE SIMD
real function f (x)
    !DIR$ OMP DECLARE SIMD
    real, intent(in), value  :: x
    f= x + 1
end function f

!DIR$ OMP SIMD
do k = 1, N
    a(k) = f(k)
enddo

See also:

Convert to Fortran SIMD-Enabled Functions

Passing an array/array recommendation to an ELEMENTAL function/subroutine is creating a dependency that prevents vectorization. To fix:

  • Enforce vectorization of the source loop using SIMD instructions and/or create a SIMD version of the function(s) using a directive:
    Target Directive
    Source loop !$OMP SIMD
    Inner function definition or declaration !$OMP DECLARE SIMD
  • Call from a DO loop.

Original code example:

elemental subroutine callee(t,q,r)
    real, intent(in) :: t, q
    real, intent(out) :: r
    r = t + q
end subroutine callee
...
do k = 1,nlev
    call callee(a(:,k), b(:,k), c(:,k))
end do
... 

Revised code example:

subroutine callee(t,q,r)
    !$OMP DECLARE SIMD(callee)
    real, intent(in) :: t, q
    real, intent(out) :: r
    r = t + q
end subroutine callee
...
do k = 1,nlev
    !$OMP SIMD
    do i = 1,n
        call callee(a(i,k), b(i,k), c(i,k))
    end do
end do
... 

See also:

Compiler Lacks Sufficient Information to Vectorize Loop

Cause: You are using a non-Intel compiler or an outdated Intel compiler. Nevertheless, it appears there are no issues preventing vectorization and vectorization may be profitable.

Explore Vectorization Opportunities

You compiled with auto-vectorization enabled; however, the compiler did not vectorize the code. Explore vectorization opportunities:

  • Run a Dependencies analysis to identify real data dependencies that could make forced vectorization unsafe.
  • Auto-Vectorizer Reporting Level to output missed optimization opportunities.
  • GNU* Fortran compiler, do one of the following:
    • Use the fopt-info-vec-missed compiler option to output missed optimization opportunities.
    • Use the OpenMP* omp simd directive to tell the compiler it is safe to vectorize.
    • Use additional auto-vectorization directives.

See also:

Enable Auto-Vectorization

You compiled with auto-vectorization disabled; enable auto-vectorization:

  • Intel compiler 14.x or below: Increase the optimization level to O2 or O3.
  • GNU* Fortran compiler, do one of the following:
    • Increase the optimization level to O2 or O3.
    • Use the ftree-vectorize compiler option.

See also:

System Function Call(s) Present

System function call(s) in the loop body are preventing the compiler from vectorizing the loop.

Remove System Function Call(s) Inside Loop

Typically system function or subroutine calls cannot be vectorized; even a print statement is sufficient to prevent vectorization. To fix: Avoid using system function calls in loops.

OpenMP* Function Call(s) Present

OpenMP* function call(s) in the loop body are preventing the compiler from effectively vectorizing the loop.

Move OpenMP Call(s) Outside The Loop Body

OpenMP calls prevent automatic vectorization when the compiler cannot move the calls outside the loop body, such as when OpenMP calls are not invariant. To fix:

  1. Split the OpenMP parallel loop directive into two directives.
    [/topic/body/section/sectiondiv/p/ol/li/table/tgroup/thead/row/entry {"col1"})

    Target

    (entry]
    [/topic/body/section/sectiondiv/p/ol/li/table/tgroup/thead/row/entry {"col2"})

    Directive

    (entry]
    [/topic/body/section/sectiondiv/p/ol/li/table/tgroup/tbody/row/entry {"col1"}) Outer (entry][/topic/body/section/sectiondiv/p/ol/li/table/tgroup/tbody/row/entry {"col2"}) !$OMP PARALLEL [clause[[,] clause] ... ] (entry][/topic/body/section/sectiondiv/p/ol/li/table/tgroup/tbody/row/entry {"col1"}) Inner (entry][/topic/body/section/sectiondiv/p/ol/li/table/tgroup/tbody/row/entry {"col2"}) !$OMP DO [clause[[,] clause] ... ] (entry]
  2. Move the OpenMP calls outside the loop when possible.

Original code example:

!$OMP PARALLEL DO PRIVATE(tid, nthreads)
do k = 1, N
    tid = omp_get_thread_num() ! this call inside loop prevents vectorization
    nthreads = omp_get_num_threads() ! this call inside loop prevents vectorization
    ...
enddo

Revised code example:

!$OMP PARALLEL PRIVATE(tid, nthreads)
! Move OpenMP calls here
tid = omp_get_thread_num()
nthreads = omp_get_num_threads()

!$OMP DO NOWAIT
do k = 1, N
    ...
enddo
!$OMP END PARALLEL

See also:

Remove OpenMP Lock Functions

Locking objects slows loop execution. To fix: Rewrite the code without OpenMP lock functions.

Allocating separate arrays for each thread and then merging them after a parallel recommendation may improve speed (but consume more memory).

See also:

Potential Inefficient Memory Access Patterns Present

Inefficient memory access patterns may result in significant vector code execution slowdown or block automatic vectorization by the compiler. Improve performance by investigating.

Confirm Inefficient Memory Access Patterns

There is no confirmation inefficient memory access patterns are present. To fix: Run a Memory Access Patterns analysis.

Inefficient Memory Access Patterns Present

There is a high of percentage memory instructions with irregular (variable or random) stride accesses. Improve performance by investigating and handling accordingly.

Reorder Loops

This loop has less efficient memory access patterns than a nearby outer loop. To fix: Reorder the loops if possible.

Original code example:

subroutine matrix_multiply(arrSize, aMatrix, bMatrix, cMatrix)
  implicit none
  real, intent(inout) :: cMatrix(:,:)
  real, intent(in)    :: aMatrix(:,:), bMatrix(:,:)
  integer, intent(in) :: arrSize
  integer :: i,j,k;

  do j=1,arrSize
    do i=1,arrSize
      do k=1,arrSize
        cMatrix(i,j) = cMatrix(i,j) + aMatrix(i,k) * bMatrix(k,j)
      end do
    end do
  end do

end subroutine matrix_multiply
Revised code example:
subroutine matrix_multiply(arrSize, aMatrix, bMatrix, cMatrix)
  implicit none
  real, intent(inout) :: cMatrix(:,:)
  real, intent(in)    :: aMatrix(:,:), bMatrix(:,:)
  integer, intent(in) :: arrSize
  integer :: i,j,k;

  do j=1,arrSize
    do k=1,arrSize
      do i=1,arrSize
        cMatrix(i,j) = cMatrix(i,j) + aMatrix(i,k) * bMatrix(k,j)
      end do
    end do
  end do

end subroutine matrix_multiply
Interchanging is not always possible because of dependencies, which can lead to different results.

Use the Fortran 2008 CONTIGUOUS Attribute

The loop is multi-versioned for unit and non-unit strides in assumed-shape arrays or pointers, but marked versions of the loop have unit stride access only. The CONTIGUOUS attribute specifies the target of a pointer or an assumed-shape array is contiguous. It can make it easier to enable optimizations that rely on the memory layout of an object occupying a contiguous block of memory.

real, pointer, contiguous :: ptr(:)
real, contiguous :: arrayarg(:, :)

When multiple calling routines are involved, to tell the compiler assumed-shape arrays and/or pointers are always contiguous in memory, use the following options available in Version 18 and higher of the Intel® Fortran Compiler:

Type Windows* OS Linux* OS
assumed-shape array /assume:contiguous_assumed_shape -assume contiguous_assumed_shape
pointer /assume:contiguous_pointer -assume contiguous_pointer
NOTE:
Results are indeterminate and could result in incorrect code and segmentation faults if the user assertion is wrong and the data is not contiguous at runtime. To check at runtime if targets of contiguous pointer assignments are indeed contiguous in memory, use the following options available in Version 18 and higher of the Intel® Fortran Compiler:
Windows OS Linux OS
/check:contiguous -check contiguous

$ ifort -DCONTIG -check contiguous -traceback

forrtl: severe (408): fort: (32): A pointer with the CONTIGUOUS attributes is being made to a non-contiguous target.
In this example, the compiler detects the assignment of a contiguous pointer to a non-contiguous target.The -traceback (Linux* OS)/ /traceback (Windows* OS) option identifies the function and source file line number at which the incorrect assignment occurs. It is not necessary to compile with the debugging option -g (Linux* and macOS* OS) / /Zi (Windows* OS) to get this traceback.

See also:

Use SoA Instead of AoS

An array is the most common type of data structure containing a contiguous collection of data items that can be accessed by an ordinal index. You can organize this data as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization is excellent for encapsulation, it can hinder effective vector processing. To fix: Rewrite code to organize data using SoA instead of AoS.

See also:

Potential Underutilization of FMA Instructions

Your current hardware supports the AVX2 instruction set architecture (ISA), which enables the use of fused multiply-add (FMA) instructions. Improve performance by utilizing FMA instructions.

Force Vectorization If Possible

The loop contains FMA instructions (so vectorization could be beneficial), but is not vectorized. To fix, review:

  • Corresponding compiler diagnostic to check if vectorization enforcement is possible and profitable
  • The Dependencies analysis to distinguish between compiler-assumed dependencies and real dependencies

See also:

Explicitly Enable FMA Generation When Using The Strict Floating-Point Model

Static analysis presumes the loop may benefit from FMA instructions available with the AVX2 ISA, but the strict floating-point model disables FMA instruction generation by default. To fix: Override this behavior using the fma compiler option.

Windows OS Linux OS
/Qfma -fma

See also:

Target The AVX2 ISA

Although static analysis presumes the loop may benefit from FMA instructions available with the AVX2 or higher ISA, no FMA instructions executed for this loop. To fix: Use the following compiler options:

  • xCORE-AVX2 to compile for machines with and without AVX2 support
  • axCORE-AVX2 to compile for machines with AVX2 support only
  • xCOMMON-AVX512 to compile for machines with AVX-512 support only
  • axCOMMON-AVX512 to compile for machines with and without AVX-512 support

NOTE:
The compiler options may vary depending on the CPU microarchitecture.
Windows OS Linux OS
/QxCORE-AVX2 or /QaxCORE-AVX2 -xCORE-AVX2 or -axCORE-AVX2
/QxCOMMON-AVX512 or /QaxCOMMON-AVX512 -xCOMMON-AVX512 or -axCOMMON-AVX512

See also:

Target A Specific ISA Instead of Using The xHost Option

Although static analysis presumes the loop may benefit from FMA instructions available with the AVX2 or higher ISA, no FMA instructions executed for this loop. To fix: Instead of using the xHost compiler option, which limits optimization opportunities by the host ISA, use the following compiler options:

  • xCORE-AVX2 to compile for machines with and without AVX2 support
  • axCORE-AVX2 to compile for machines with AVX2 support only
  • xCOMMON-AVX512 to compile for machines with AVX-512 support only
  • axCOMMON-AVX512 to compile for machines with and without AVX-512 support

NOTE:
The compiler options may vary depending on the CPU microarchitecture.
Windows OS Linux OS
/QxCORE-AVX2 or /QaxCORE-AVX2 -xCORE-AVX2 or -axCORE-AVX2
/QxCOMMON-AVX512 or /QaxCOMMON-AVX512 -xCOMMON-AVX512 or -axCOMMON-AVX512

See also:

Indirect Function Call(s) Present

Indirect function call(s) in the loop body are preventing the compiler from vectorizing the loop. Indirect calls, sometimes called indirect jumps, get the callee address from a register or memory; direct calls get the callee address from an argument. Even if you force loop vectorization, indirect calls remain serialized.

Improve Branch Prediction

For 64-bit applications, branch prediction performance can be negatively impacted when the branch target is more than 4 GB away from the branch. This is more likely to happen when the application is split into shared libraries. To fix: Do the following:

  • Upgrade the Glibc library to version 2.23 or higher.
  • Set environment variable export LD_PREFER_MAP_32BIT_EXEC=1.

See also:

Remove Insirect Call(s) Inside The Loop

Indirect function or subroutine calls cannot be vectorized. To fix: Avoid using indirect calls in loops.

Inefficient Processing of SIMD-enabled Functions Possible

Vector declaration defaults for your SIMD-enabled functions may result in extra computations or ineffective memory access patterns. Improve performance by overriding defaults.

Specify the Value of the Underlying Reference as Linear

In Fortran applications, by default, scalar arguments are passed by reference. Therefore, in SIMD-enabled functions, arguments are passed as a short vector of addresses instead of a single address. The compiler then gathers data from the vector of addresses to create a short vector of values for use in subsequent vector arithmetic. This gather activity negatively impacts performance. To fix: Add a LINEAR clause with a REF modifier (introduced in OpenMP* 4.5) to your vector declaration. Specifically, add LINEAR (REF(linear-list[: linear-step])) to your !$OMP DECLARE SIMD directive.

See also:

Target a Specific Processor Type(s)

The default instruction set architecture (ISA) for SIMD-enabled functions is inefficient for your host processor because it could result in extra memory operations between registers. To fix: Add one of the following to tell the compiler to generate an extended set of vector functions.

Windows OS Linux OS
PROCESSOR(cpuid) to !$OMP DECLARE SIMD PROCESSOR(cpuid) to !$OMP DECLARE SIMD
/Qvecabi:cmdtarget Note: Vector variants are created for targets specified for targets specified by compiler options /Qx or /Qax -vecabi=cmdtarget Note: Vector variants are created for targets specified for targets specified by compiler options -x or -ax

See also:

Enforce the Compiler to Ignore Assumed Vector Dependencies

No real dependencies were detected, so there is no need for conflict-detection instructions. To fix: Tell the compiler it is safe to vectorize using a directive !DIR$ IVDEP.

NOTE:
This fix may be unsafe in other scenarios; use with care to avoid incorrect results.

!DIR$ IVDEP
do i = 1, N
    a(index(i)) = b(i) * c
enddo

See also:

Opportunity for Outer Loop Vectorization

This is outer (non-innermost) loop. Normally outer loops are not targeted by auto-vectorization. Outer loop vectorization is also possible and sometimes more profitable, but requires explicit vectorization using OpenMP* API or Intel® Cilk™ Plus.

Collect Trip Counts Data

The Survey Report lacks trip counts data that might prove profitability for outer loop vectorization. To fix: Run a Trip Counts analysis.

Check Dependencies for Outer Loop

It is not safe to force vectorization without knowing that there are no dependencies. Disable inner vectorization before check Dependency. To check: Run a Dependencies analysis.

Check Memory Access Patterns for Outer Loop

To ensure that outer loop has optimal memory access patterns run a Memory Access Patterns analysis.

Consider Outer Loop Vectorization

The compiler never targets loops other than innermost ones, so it vectorized the inner loop while did not vectorize the outer loop. However outer loop vectorization could be more profitable because of better Memory Access Pattern, higher Trip Counts or better Dependencies profile.

To enforce outer loop vectorization:

Target Directive
Outer Loop !$OMP SIMD
Inner Loop !$OMP NOVECTOR

!$OMP SIMD
DO I=1,N
    !$OMP NOVECTOR
    DO J=1,N
        SUM = SUM + A(i)*A(j)
    ENDDO
ENDDO

See also:

Consider Outer Loop Vectorization.

The compiler did not vectorize the loop as the code exceeds the compilers complexity criteria. You might get higher performance if you enforce the loop vectorization. Use a directive right before your loop block in the source code.

ICL/ICC/ICPC Directive
!$OMP SIMD

See also:

Consider Outer Loop Vectorization

The compiler did not vectorize the inner loop due to potential dependencies detected. You might vectorize outer loop if it has no dependency. Use a directive right before your loop block in the source code.

ICL/ICC/ICPC Directive
!$OMP SIMD

See also:

Potential Underutilization of Approximate Reciprocal Instructions

Your current hardware supports Advanced Vector Extensions 512 (AVX-512) instructions that enable the use of approximate reciprocal and reciprocal square root instructions both for single- and double-precision floating-point calculations. Improve performance by utilizing these instructions.

Force Vectorization If Possible

The loop contains SQRT/DIV instructions (so vectorization could be beneficial), but is not vectorized. To fix, review:

  • Corresponding compiler diagnostic to check if vectorization enforcement is possible and profitable
  • The Dependencies analysis to distinguish between compiler-assumed dependencies and real dependencies

See also:

Target the AVX-512 ISA

Static analysis presumes the loop may benefit from AVX-512 approximate reciprocal instructions, but these instructions were not used. To fix: Use one of the following compiler options:

  • xCOMMON-AVX512 - tells the compiler which processor features to target, including instructions sets and optimizations it may generate, including AVX-512.
  • axCOMMON-AVX512 - tells the compiler to generate multiple, feature-specific, auto-dispatch code for Intel processors if there is a performance benefit.

Windows OS Linux OS
/QxCOMMON-AVX512 or /QaxCOMMON-AVX512 -xCOMMON-AVX512 or -axCOMMON-AVX512

See also:

Target the AVX-512 Exponential and Reciprocal Instructions ISA

Static analysis presumes the loop may benefit from AVX-512 Exponential and Reciprocal (AVX-512ER) instructions currently supported only on Intel® Xeon Phi™ processors, but these instructions were not used. To fix: Use one of the following compiler options:

  • xMIC-AVX512 - tells the compiler which processor features to target, including instructions sets and optimizations it may generate, including AVX-512ER.
  • axMIC-AVX512 - tells the compiler to generate multiple, feature-specific, auto-dispatch code for Intel processors if there is a performance benefit.

Windows OS Linux OS
/QxMIC-AVX512 or /QaxMIC-AVX512 -xMIC-AVX512 or -axMIC-AVX512

See also:

Enable the Use of Approximate Reciprocal Instructions by Fine-Tuning Precision and Floating-Point Model Compiler Options

Static analysis presumes the loop may benefit from using approximate reciprocal instructions, but the precision and floating-point model settings may prevent the compiler from using these instructions. To fix: Fine-tune your usage of the following compiler options:

Windows OS Linux OS Comment
/fp -fp-model -fp-model=precise prevents the use of approximate reciprocal instructions.
/Qimf-precision -fimf-precision Consider using -fimf-precision=medium or -fimf-precision=low.
/Qimf-accuracy-bits -fimf-accuracy-bits Consider decreasing this setting.
/Qimf-max-error -fimf-max-error

Consider increasing this setting.

There is a similar option: -fimf-absolute-error. Avoid using both options at the same time or tune them together.

/Qimf-absolute-error -fimf-absolute-error Consider using -fimf-max-error instead and set -fimf-absolute-error=0 (default) or increase this setting together with -fimf-max-error.
/Qimf-domain-exclusion -fimf-domain-exclusion Consider increasing this setting. More excluded classes enable more optimized code. USE WITH CAUTION. This option may cause incorrect behavior if your calculations involve excluded domains.
/Qimf-arch-consistency -fimf-arch-consistency -fimf-arch-consistency=true may prevent the use of approximate reciprocal instructions.
/Qprec-div -prec-div -prec-div prevents the use of approximate reciprocal instructions.
/Qprec-sqrt -prec-sqrt -prec-sqrt prevents the use of approximate reciprocal instructions.

See also:

Possible Inefficient Conflict-Detection Instructions Present

Stores with indirect addressing caused the compiler to assume a potential dependency.

This resulted in the use of conflict-detection instructions during SIMD processing, such as the AVX-512 vpconflict instruction, which detects duplicate values within a vector and creates conflict-free subsets. Improve performance by removing the need for conflict-detection instructions.

Enforce the Compiler to Ignore Assumed Vector Dependencies

No real dependencies were detected, so there is no need for conflict-detection instructions. To fix: Tell the compiler it is safe to vectorize using a directive !DIR$ IVDEP.

NOTE:
This fix may be unsafe in other scenarios; use with care to avoid incorrect results.

!DIR$ IVDEP
do i = 1, N
    a(index(i)) = b(i) * c
enddo

See also:

Unoptimized Floating-Point Operation Processing Possible

Improve performance by enabling approximate operations instructions.

Enable the Use of Approximate Division Instructions

Static analysis presumes the loop may benefit from using approximate calculations. Independent dividors will be pre-calculated and replaced with multiplicators. To fix: Fine-tune your usage of the following compiler option:

Windows OS Linux OS Comment
/Qprec-div -no-prec-div -no-prec-div enables the use of approximate division optimizations.

See also:

Enable the Use of Approximate sqrt Instructions

Static analysis presumes the loop may benefit from using approximate sqrt instructions, but the precision and floating-point model settings may prevent the compiler from using these instructions. To fix: Fine-tune your usage of the following compiler option:

Windows OS Linux OS Comment
/Qprec-sqrt -no-prec-sqrt -no-prec-sqrt enables the use of approximate sqrt optimizations.

See also:

Potential Excessive Caching Present

Enable Non-Temporal Store

Enable non-temporal store using !DIR$ vector nontemporal. The nontemporal clause directs the compiler to use non-temporal (that is, streaming) stores, optionally takes a comma-separated list of variables.

Streaming stores may cause significant performance improvements over non-streaming stores for large numbers on certain processors. However, the misuse of streaming stores can significantly degrade performance.

!DIR$ vector nontemporal
do i=1,N
  arr1(i) = 0
end do

See also:

Misaligned Loop Code Present

Current placement of the loop in memory may result in inefficient use of the CPU front-end. Improve performance by aligning loop code.

Force the Compiler to Align Loop Code

CAUTION:
Excessive code alignment may increase application binary size and decrease performance.

Static analysis shows the loop may benefit from code alignment. To fix: Force the compiler to align the loop to a power-of-two byte boundary using a compiler directive for finer-grained control: !DIR$ CODE_ALIGN [:n]

Align inner loop to 32-byte boundary:

!DIR$ CODE_ALIGN :64
do i = 1, n, 1
    do j = 1, m, 1
        a(i) = a(i) * (b(i) + c(j))
    enddo
enddo

You may also need the following compiler option:

Windows OS Linux OS and Mac OS
/Qalign-loops[:n] -falign-loops[=n]

where n = a power of 2 betwen 1 and 4096, such as 1, 2, 4, 8, 16, 32, etc. n = 1 performs no alignment. If n is not present, the compiler uses an alignment of 16 bytes. Suggestion: Try 16 and 32 first.

/Qalign-loops- and -fno-align-loops, the default compiler option, disables special loop alignment.