Tutorial

  • 04/11/2022
  • Public Content

Improving Performance by Aligning Data

The vectorizer can generate faster code when operating on aligned data. In this activity you will improve performance by aligning the arrays
a
,
b
, and
x
in
Driver.c
on a 16-byte boundary so that the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the
ALIGNED
macro will modify the declarations of
a
,
b
, and
x
in
Driver.c
using the
aligned attribute
keyword, which has the following syntax:
float array[30] __attribute__((aligned(base, [offset])));
This instructs the compiler to create an array that it is aligned on a "base"-byte boundary with an "offset" (Default=0) in bytes from that boundary. Example:
FTYPE a[ROW][COLWIDTH] __attribute__((aligned(16)));
In addition, the row length of the matrix,
a
, needs to be padded out to be a multiple of 16 bytes, so that each individual row of
a
is 16-byte aligned. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in
Multiply.c
are aligned by using
#pragma vector aligned
.
If you use
#pragma vector aligned
, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if
#pragma vector aligned
is not used. See the code under the
ALIGNED
macro in
Multiply.c
If your compilation targets the Intel® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case,
#pragma vector aligned
advises the compiler that the data is 32-byte aligned.
Recompile the program after adding the
ALIGNED
macro to ensure consistently aligned data. Use
-qopt-report=4
to see the change in aligned references.
icc -std=c99 -qopt-report=4 -qopt-report-phase=vec -D NOALIAS -D ALIGNED Multiply.c Driver.c -o MatVector
Multiply.optrpt
before adding the
#pragma vector aligned
shows:
LOOP BEGIN at Multiply.c(49,9) <Peeled loop for vectorization> LOOP END LOOP BEGIN at Multiply.c(49,9) remark #15388: vectorization support: reference a[i][j] has aligned access [ Multiply.c(50,21) ] remark #15388: vectorization support: reference x[j] has aligned access [ Multiply.c(50,31) ] remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 4 remark #15309: vectorization support: normalized vectorization overhead 1.031 remark #15300: LOOP WAS VECTORIZED remark #15442: entire loop may be executed in remainder remark #15448: unmasked aligned unit stride loads: 2 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 10 remark #15477: vector cost: 4.000 remark #15478: estimated potential speedup: 2.380 remark #15488: --- end vector cost summary --- LOOP END LOOP BEGIN at Multiply.c(49,9) <Alternate Alignment Vectorized Loop> LOOP END LOOP BEGIN at Multiply.c(49,9) <Remainder loop for vectorization> LOOP END
And after adding
-D ALIGNED
:
LOOP BEGIN at Multiply.c(49,9) remark #15388: vectorization support: reference a[i][j] has aligned access [ Multiply.c(50,21) ] remark #15388: vectorization support: reference x[j] has aligned access [ Multiply.c(50,31) ] remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 4 remark #15309: vectorization support: normalized vectorization overhead 0.594 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 2 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 10 remark #15477: vector cost: 4.000 remark #15478: estimated potential speedup: 2.410 remark #15488: --- end vector cost summary --- LOOP END LOOP BEGIN at Multiply.c(49,9) <Remainder loop for vectorization> remark #15388: vectorization support: reference a[i][j] has aligned access [ Multiply.c(50,21) ] remark #15388: vectorization support: reference x[j] has aligned access [ Multiply.c(50,31) ] remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override remark #15305: vectorization support: vector length 2 remark #15309: vectorization support: normalized vectorization overhead 2.417 LOOP END
Your line and column numbers may be different.
Now, run the executable and record the execution time.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.