Auto-Vectorization of DO CONCURRENT with Multiple Indices

Published: 07/04/2018  

Last Updated: 07/18/2018

Let's consider the following snippet of code:

SUBROUTINE do_concurrent(n, arr1, arr2)
  IMPLICIT NONE
  INTEGER, INTENT(IN) :: n
  REAL :: arr1(n,n,n), arr2(n,n,n)
  INTEGER :: i, j, k
  DO CONCURRENT (i=1:n,j=1:n,k=1:n)
     arr1(i,j,k) = arr1(i,j,k) + arr2(i,j,k)
  END DO
END SUBROUTINE do_concurrent

and compile it as follows:

ifort -xCORE-AVX2 -qopt-report=5 -qopt-report-file=stdout -c do_concurrent.f90 -o doconcurrent.o

The compiler optimization report shows that the loop is not vectorized:

...
 LOOP BEGIN at do_concurrent.f90(6,3)
         remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
         remark #15329: vectorization support: non-unit strided store was emulated for the variable <arr1(i,j,k)>, stride is unknown to compiler   [ do_concurrrnt.f90(7,6) ]
...

The Fortran standard allows the iterations of a DO CONCURRENT construct to be executed in any order and the index variables and associated ranges may be specified in any order. However, in the Intel ® Fortran implementation, the order of the index variables matters and the DO CONCURRENT is interpreted as nested loops in the order specified by the initial statement from the outermost to innermost. In the above example, the loop is not vectorized because the inner loop is over k, so the memory accesses are not contiguous and consequently the compiler thinks that vectorization would not be worthwhile (“inefficient”).

The workaround is to rewrite the DO CONCURRENT header as follows:

SUBROUTINE do_concurrent(n, arr1, arr2)
  IMPLICIT NONE
  INTEGER, INTENT(IN) :: n
  REAL :: arr1(n,n,n), arr2(n,n,n)
  INTEGER :: i, j, k
  DO CONCURRENT (k=1:n,j=1:n,i=1:n)
     arr1(i,j,k) = arr1(i,j,k) + arr2(i,j,k)
  END DO
END SUBROUTINE do_concurrent

which matches the natural memory layout for arrays in Fortran. The innermost loop over i is then auto-vectorized with unit stride as the compiler optimization report shows:

...
LOOP BEGIN at do_concurrrnt.f90(6,3)
...
  remark #15388: vectorization support: reference arr1(i,j,k) has aligned access   [ do_concurrent.f90(7,6) ]
...
  remark #15300: LOOP WAS VECTORIZED
...

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.