Tutorial

  • 04/11/2022
  • Public Content

Improving Performance by Aligning Data

The vectorizer can generate faster code when operating on aligned data. In this activity you will improve performance by aligning the arrays
a
,
b
, and
x
in
Driver.c
on a 16-byte boundary so that the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the
ALIGNED
macro will modify the declarations of
a
,
b
, and
x
in
Driver.c
using
__align declspec
, which has the following syntax:
__declspec(align(16)) float array[30];
This instructs the compiler to create an array that it is aligned on a "base"-byte boundary with an "offset" (Default=0) in bytes from that boundary. Example:
__declspec(align(16)) FTYPE a[ROW][COLWIDTH];
In addition, the row length of the matrix,
a
, needs to be padded out to be a multiple of 16 bytes, so that each individual row of
a
is 16-byte aligned. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in
Multiply.c
are aligned by using
#pragma vector aligned
.
If you use
#pragma vector aligned
, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if
#pragma vector aligned
is not used. See the code under the
ALIGNED
macro in
Multiply.c
If your compilation targets the Intel® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case,
#pragma vector aligned
advises the compiler that the data is 32-byte aligned.
Use
/Qopt-report:4
to see the report reflect the updated references (in your project's property pages select
Configuration Properties
C/C++
Diagnostics [Intel C++]
and for
Optimization Diagnostics Level
, select
Level 4 (/Qopt-report:4)
). Rebuild the program after adding the
ALIGNED
preprocessor definition to ensure consistently aligned data.
Multiply.optrpt before using #pragma vector aligned:
LOOP BEGIN at Multiply.c(49,9) Peeled loop for vectorization LOOP END LOOP BEGIN at Multiply.c(49,9) Multiply.c(50,13):remark #15388: vectorization support: reference a[i][j] has aligned access Multiply.c(50,13):remark #15388: vectorization support: reference x[j] has aligned access Multiply.c(49,9):remark #15305: vectorization support: vector length 2 Multiply.c(49,9):remark #15399: vectorization support: unroll factor set to 4 Multiply.c(49,9):remark #15309: vectorization support: normalized vectorization overhead 1.031 Multiply.c(49,9):remark #15300: LOOP WAS VECTORIZED Multiply.c(49,9):remark #15442: entire loop may be executed in remainder Multiply.c(49,9):remark #15448: unmasked aligned unit stride loads: 2 Multiply.c(49,9):remark #15475: --- begin vector cost summary --- Multiply.c(49,9):remark #15476: scalar cost: 10 Multiply.c(49,9):remark #15477: vector cost: 4.000 Multiply.c(49,9):remark #15478: estimated potential speedup: 2.380 Multiply.c(49,9):remark #15488: --- end vector cost summary --- LOOP END LOOP BEGIN at Multiply.c(49,9) Alternate Alignment Vectorized Loop LOOP END LOOP BEGIN at Multiply.c(49,9) Remainder loop for vectorization LOOP END
Multiply.optrpt after adding
ALIGNED
to the preprocessor definitions:
LOOP BEGIN Multiply.c(49,9) Multiply.c(50,13):remark #15388: vectorization support: reference a[i][j] has aligned access Multiply.c(50,13):remark #15388: vectorization support: reference x[j] has aligned access Multiply.c(49,9):remark #15305: vectorization support: vector length 2 Multiply.c(49,9):remark #15399: vectorization support: unroll factor set to 4 Multiply.c(49,9):remark #15309: vectorization support: normalized vectorization overhead 0.594 Multiply.c(49,9):remark #15300: LOOP WAS VECTORIZED Multiply.c(49,9):remark #15448: unmasked aligned unit stride loads: 2 Multiply.c(49,9):remark #15475: --- begin vector cost summary --- Multiply.c(49,9):remark #15476: scalar cost: 10 Multiply.c(49,9):remark #15477: vector cost: 4.000 Multiply.c(49,9):remark #15478: estimated potential speedup: 2.410 Multiply.c(49,9):remark #15488: --- end vector cost summary --- LOOP END LOOP BEGIN at Multiply.c(49,9) Remainder loop for vectorization Multiply.c(50,13):remark #15388: vectorization support: reference a[i][j] has aligned access Multiply.c(50,13):remark #15388: vectorization support: reference x[j] has aligned access Multiply.c(49,9):remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or /Qvec-threshold0 to override Multiply.c(49,9):remark #15305: vectorization support: vector length 2 Multiply.c(49,9):remark #15309: vectorization support: normalized vectorization overhead 2.417 LOOP END
Your line and column numbers may be different.
Now, run the executable and record the execution time.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.