The previous examples made use of double precision arrays. They may be built instead with single precision arrays by
adding the macro,
. The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements.
In the example with data alignment, you will need to set
to ensure 16-byte alignment for each row of the matrix
#pragma vector aligned
will cause the program to fail.
This completes the tutorial that shows how the compiler can optimize performance with various vectorization techniques.