Let's consider the following snippet of code:
and compile it as follows:
The compiler optimization report shows that the loop is not vectorized:
The Fortran standard allows the iterations of a DO CONCURRENT construct to be executed in any order and the index variables and associated ranges may be specified in any order. However, in the Intel ® Fortran implementation, the order of the index variables matters and the DO CONCURRENT is interpreted as nested loops in the order specified by the initial statement from the outermost to innermost. In the above example, the loop is not vectorized because the inner loop is over k, so the memory accesses are not contiguous and consequently the compiler thinks that vectorization would not be worthwhile (“inefficient”).
The workaround is to rewrite the DO CONCURRENT header as follows:
which matches the natural memory layout for arrays in Fortran. The innermost loop over i is then auto-vectorized with unit stride as the compiler optimization report shows: