Vectorization and Array Contiguity with the Intel® Fortran Compiler

ID 标签 689481
已更新 8/10/2018
版本 Latest
公共

author-image

作者

Subroutine dummy arguments can be pointers or assumed shape arrays, e.g.:

SUBROUTINE SUB(X, Y)
    REAL, DIMENSION(:)          :: X  ! assumed shape array
    REAL, DIMENSION(:), POINTER :: Y  ! pointer

This avoids the need to pass parameters such as array bounds explicitly. The Fortran standard allows the actual arguments to be non-contiguous array sections or pointers, e.g.:

    CALL SUB(A(1:100:10)    ! non-unit stride
    CALL SUB(B(2:4,:)       ! incomplete columns

Therefore, the compiler cannot assume that consecutive elements of X and Y are adjacent in memory and so cannot blindly issue efficient vector loads for several elements at once.

    If you know that such dummy arguments will always be contiguous in memory, you can use the CONTIGUOUS keyword to tell the compiler and it will generate more efficient code, e.g.:

  REAL, DIMENSION(:), CONTIGUOUS          :: X  ! assumed shape array
  REAL, DIMENSION(:), CONTIGUOUS, POINTER :: Y  ! pointer

However, the calling routine also needs to know that the arrays are contiguous (see /content/www/cn/zh/develop/videos/effective-parallel-optimizations-with-intel-fortran.html). When multiple routines are involved, it may be simpler to use a command line switch to tell the compiler that assumed shape arrays and/or pointers are always contiguous. The version 18 compiler is the first to support new options:

   -assume contiguous_assumed_shape  and  -assume contiguous_pointer  (Linux*)
   /assume:contiguous_assumed_shape  and  /assume_contiguous_pointer  (Windows*)

These will cause the compiler to assume that all such objects are contiguous in memory.

In some cases where contiguity is unknown, version 18 and newer compilers may generate alternative code versions for the contiguous and non-contiguous cases and check the stride at run-time to determine which version to execute.

Consider the following example (shown for Linux but applicable to other OS):

  subroutine sub(a,b)
    real, pointer, dimension(:) :: a,b
    integer :: i,n
    
    n = size(a,1)
!$OMP SIMD   
    do i=1,n
       a(i) = log(b(i)) 
    enddo
  end subroutine sub

ifort -c -qopt-report=3 -qopt-report-file=stderr sub.f90

LOOP BEGIN at sub.f90(7,5)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed ANTI dependence between b%2e%2e(0) (8:15) and a(i) (8:8)
LOOP END

When compiled without OpenMP, the loop is not vectorized because the compiler must assume that the pointers A and B might alias each other (the data they point to might overlap). This can be overcome by activating the OpenMP SIMD directive with -qopenmp-simd, which tells the compiler it can assume there is no overlap and no dependency:

ifort -c -qopenmp-simd -qopt-report=3 -qopt-report-file=stderr sub.f90

LOOP BEGIN at sub.f90(7,5)
   remark #15328: vectorization support: non-unit strided load was emulated for the variable <b(i)>, stride is unknown to compiler   [ sub.f90(8,19) ]
   remark #15329: vectorization support: non-unit strided store was emulated for the variable <a(i)>, stride is unknown to compiler   [ sub.f90(8,8) ]
   remark #15305: vectorization support: vector length 4
   remark #15309: vectorization support: normalized vectorization overhead 0.007
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
   remark #15452: unmasked strided loads: 1
   remark #15453: unmasked strided stores: 1
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 106
   remark #15477: vector cost: 35.500
   remark #15478: estimated potential speedup: 2.980
   remark #15482: vectorized math library calls: 1
   remark #15488: --- end vector cost summary ---
LOOP END

The loop has been vectorized successfully, with an estimated speed-up, but the compiler had to generate non-unit-strided loads and stores because it did not know whether X and Y were contiguous.

If we assert to the compiler that the pointer arguments are contiguous:

ifort -c -qopenmp-simd -assume contiguous_pointer -qopt-report=4  -qopt-report-file=stderr sub.f90

LOOP BEGIN at sub.f90(7,5)
   remark #15389: vectorization support: reference b(i) has unaligned access   [ sub.f90(8,19) ]
   remark #15388: vectorization support: reference a(i) has aligned access   [ sub.f90(8,8) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15305: vectorization support: vector length 4
   remark #15309: vectorization support: normalized vectorization overhead 0.179
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
   remark #15442: entire loop may be executed in remainder
   remark #15449: unmasked aligned unit stride stores: 1
   remark #15450: unmasked unaligned unit stride loads: 1
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 106
   remark #15477: vector cost: 19.500
   remark #15478: estimated potential speedup: 5.120
   remark #15482: vectorized math library calls: 1
   remark #15488: --- end vector cost summary ---
LOOP END

The compiler is able to vectorize the loop using unit stride loads and stores and the estimated speed-up increases accordingly. (Note that this is only an estimate based on what is known at compile time; the actual speed-up is influenced by many factors, such as data location and alignment, and can be substantially different). Using the CONTIGUOUS keyword instead of the command line switch would have the same effect.

In conclusion, if you know that pointer arrays or assumed shape dummy arguments will always correspond to a contiguous space in memory, you can help the compiler to vectorize more efficiently my telling it so. Use either the CONTIGUOUS keyword or the command line switches  -assume contiguous_assumed_shape (/assume:contiguous_assumed_shape) or -assume contiguous_pointer (/assume:contiguous_pointer) which are new in version 18 and newer compilers.

"