Intel® Advisor User Guide

ID 766448
Date 11/07/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Vectorization Recommendations for Fortran

Ineffective Peeled/Remainder Loop(s) Present

All or some source loop iterations are not executing in the loop body. Improve performance by moving source loop iterations from peeled/remainder loops to the loop body.

Align Data

One of the memory accesses in the source loop does not start at an optimally aligned address boundary. To fix: Align the data and tell the compiler the data is aligned. To align data, use __declspec(align()). To tell the compiler the data is aligned, use __assume_aligned() before the source loop.

See also:

Parallelize The Loop with Both Threads and SIMD Instructions

The loop is threaded and auto-vectorized; however, the trip count is not a multiple of vector length. To fix: Do all of the following:

  • Use the !$omp parallel do simd directive to parallelize the loop with both threads and SIMD instructions. Specifically, this directive divides loop iterations into chunks (subsets) and distributes the chunks among threads, then chunk iterations execute concurrently using SIMD instructions.
  • Add the schedule(simd: [kind]) modifier to the directive to guarantee the chunk size (number of iterations per chunk) is a multiple of vector length.

Original code sample:

!$omp parallel do schedule(static)
do i = 1,1000
    c(i) = a(i)*b(i)
end do
!$omp end parallel do

Revised code sample:

!$omp parallel do simd schedule(simd: static)
do i = 1,1000
    c(i) = a(i)*b(i)
end do
!$omp end parallel do simd

See also:

Force Scalar Remainder Generation

The compiler generated a masked vectorized remainder loop that contains too few iterations for efficient vector processing. A scalar loop may be more beneficial. To fix: Force scalar remainder generation using a directive: !DIR$ VECTOR NOVECREMAINDER.

subroutine add(A, N, X)
    integer N, X
    real    A(N)
    ! Force the compiler to not vectorize the remainder loop
    !DIR$ VECTOR NOVECREMAINDER
    do i=x+1, n
        a(i) = a(i) + a(i-x)
    enddo
end

See also:

Force Vectorized Remainder

The compiler did not vectorize the remainder loop, even though doing so could improve performance. To fix: Force vectorization using a directive: !DIR$ VECTOR VECREMAINDER.

subroutine add(A, N, X)
    integer N, X
    real    A(N)
    ! Force the compiler to vectorize the remainder
    !DIR$ VECTOR VECREMAINDER
    do i=x+1, n
        a(i) = a(i) + a(i-x)
    enddo
end

See also:

Specify The Expected Loop Trip Count

The compiler cannot detect the trip count statically. To fix: Specify the expected number of iterations using a directive: !DIR$ LOOP COUNT.

Iterate through a loop a maximum of ten, minimum of three, and average of five times:

!DIR$ LOOP COUNT MAX(10), MIN(3), AVG(5)
do i =1, m
    b(i) = a(i) + 1
    d(i) = c(i) + 1
enddo

See also:

Change The Chunk Size

The loop is threaded and vectorized using the !$omp parallel for simd directive, which parallelizes the loop with both threads and SIMD instructions. Specifically, the directive divides loop iterations into chunks (subsets) and distributes the chunks among threads, then chunk iterations execute concurrently using SIMD instructions. In this case, the chunk size (number of iterations per chunk) is not a multiple of vector length. To fix: Add a schedule (simd: [kind]) modifier to the !$omp parallel for simd directive.

Guarantee a maximum vector length.

!$omp parallel do simd schedule(simd: static)
do i = 1,1000
    c(i) = a(i)*b(i)
end do
!$omp end parallel do simd

See also:

Add Data Padding

The trip count is not a multiple of vector length . To fix: Do one of the following:

  • Increase the size of objects and add iterations so the trip count is a multiple of vector length.
  • Increase the size of static and automatic objects, and use a compiler option to add data padding.

See also:

Collect Trip Counts Data

The Survey Report lacks trip counts data that might generate more precise recommendations.

Disable Unrolling

The trip count after loop unrolling is too small compared to the vector length . To fix: Prevent loop unrolling or decrease the unroll factor using a directive: !DIR$ NOUNROLL or !DIR$ UNROLL.

Disable automatic loop unrolling using !DIR$ NOUNROLL.

!DIR$ NOUNROLL
do i = 1, m
    b(i) = a(i) + 1
    d(i) = c(i) + 1
enddo

See also:

Use A Smaller Vector Length

The compiler chose a vector length of , but the trip count might be smaller than the vector length. To fix: Specify a smaller vector length using a directive: !$OMP SIMD SIMDLEN.

!$OMP SIMD SIMDLEN(4)
do i = 1, m
    b(i) = a(i) + 1
    d(i) = c(i) + 1
enddo

In Intel Compiler version 19.0 and higher, there is a new vector length clause that allows the compiler to choose the best vector length based on cost: !DIR$ VECTOR VECTORLENGTH (vl1, vl2, ..., vln) where vl is an integer power of 2.

!DIR$ VECTOR VECTORLENGTH(2, 4, 16)
do i = 1, m
    b(i) = a(i) + 1
    d(i) = c(i) + 1
enddo

See also:

Disable Dynamic Alignment

The compiler automatically peeled iterations from the vector loop into a scalar loop to align the vector loop with a particular memory reference; however, this optimization may not be ideal. To possibly achieve better performance, disable automatic peel generation using the directive: !DIR$ VECTOR NODYNAMIC_ALIGN.

...
!DIR$ VECTOR NODYNAMIC_ALIGN
do i = 1, len
    a(i) = b(i) * c(i)
enddo

See also:

Serialized User Function Call(s) Present

User-defined functions in the loop body are not vectorized.

Enable Inline Expansion

Inlining of user-defined functions is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.

Windows* OS

Linux* OS

/Ob1 or /Ob2 -inline-level=1 or -inline-level=2

See also:

Vectorize Serialized Function(s) Inside Loop

  • Enforce vectorization of the source loop by means of SIMD instructions and/or create a SIMD version of the function(s) using a directive:
    Target Directive
    Source Loop !$OMP SIMD
    Inner function definition or declaration !$OMP DECLARE SIMD
  • If using the Ob or inline-level compiler option to control inline expansion with the 1 argument, use an inline keyword to enable inlining or replace the 1 argument with 2 to enable inlining of any function at compiler discretion.

real function f (x)
    !DIR$ OMP DECLARE SIMD
    real, intent(in), value  :: x
    f= x + 1
end function f

!DI