Utilizing Full Vectors and Use of Option -qopt-assume-safe-padding

ID 660493
Updated 3/6/2019
Version Latest
Public

author-image

By

Vectorization Essentials, Utilizing Full Vectors and Use of Option -qopt-assume-safe-padding

Efficient vectorization involves making full use of the vector-hardware. This implies that users should strive to get most code to be executed in the kernel-vector loop as opposed to peel-loop and/or remainder-loop.

Remainder Loop:

A Remainder Loop executes the remaining iterations when the trip (or loop)-count of the vector loop is not a multiple of the vector length. While this is unavoidable in many cases, having a large amount of time spent in remainder loops will lead to performance inefficiencies. For example, if the vector loop trip-count is 20 and the vector length is 16, the compiler generates a kernel loop that gets executed once (processing 16 elements in one vector, the vector length).  The remainder 4 iterations have to be executed in the remainder loop. Although the Intel compiler may vectorize the remainder-loop as reported by -qopt-report, it won't be as efficient as the kernel loop. For example, the remainder loop will use vector masks and it may have to use gathers/scatters, instead of, unit stride loads/stores due to memory-fault-protection issues. The best way to address this is to refactor the algorithm/code in such a way that the remainder loop is NOT executed at runtime. To do this, make trip counts a multiple of the vector length and/or making the trip count large compared to the vector length so that the overhead of any execution in the remainder-loop is comparatively low.

The compiler optimizations also take into account any knowledge of actual trip-count values. So, if the trip-count is 20, compiler usually makes better decisions if it knows that the trip-count is 20 (trip-count is a constant known statically to the compiler) as opposed to a trip-count of n (symbolic value) that happens to have a value of 20 at runtime, maybe it is an input value read in from a file. In the latter case, you can help the compiler by using a "#pragma loop_count (20)" in C/C++ or "!DIR$ LOOP COUNT (20)" in Fortran pragma/directive before the loop.

Also, take into account any unrolling of the vector-loop done by the compiler by studying output from "-qopt-report=5 -qopt-report-phase=vec" options. For example, if the compiler vectorizes a loop of trip-count n and vec-length 16 and unrolls the loop by 2 after vectorization, each kernel-loop is designed to execute 32 iterations of the original src-loop. If the dynamic trip-count happens to be 20, the kernel-loop gets skipped completely and all execution will happen in the remainder-loop. If you encounter this issue, you can use the "#pragma nounroll" in C/C++ or "!DIR$ NOUNROLL" in Fortran to turn off the unrolling of the vector-loop. You can also use the loop_count pragma described earlier instead to influence the compiler heuristics.

If you want to disable vectorization of the remainder-loop generated by the compiler, use "#pragma vector novecremainder" in C/C++ or "!DIR$ vector noremainder" in Fortran pragma/directive before the loop. Using this also disables vectorization of any peel-loop generated by the compiler for this loop. 

Peel Loop:

Compiler generates dynamic-peel loops typically to align one of the memory accesses inside the loop. The peel-loop peels a few iterations of the original src-loop until the candidate memory-access gets aligned. The peel-loop is guaranteed to have a trip-count that is smaller than the vector-length. This optimization is done so that the kernel vector-loop can utilize more aligned load/store instructions, thus increasing the performance efficiency of the kernel-loop. But the peel-loop itself, even though it may be vectorized by the compiler, is less efficient. Study the "-qopt-report=5 -qopt-report-phase=vec" output from the compiler. The best way to address this is to refactor the algorithm/code in such a way that the accesses are aligned and the compiler knows about the alignment following the vectorizer alignment BKMs (best known methods). If the compiler knows that all accesses are aligned say, the user correctly uses "#pragma vector aligned" before the loop so that compiler can safely assume all memory accesses inside the loop are aligned, then there will be no peel-loop generated by the compiler.

You can also use the loop_count pragma described earlier to influence the compiler decision of whether or not to create a peel loop.

You can instruct the compiler to NOT generate a dynamic peel loop by adding  "#pragma vector unaligned" in C/C++ or "!DIR$ vector unaligned" in Fortran pragma/directive before the loop in the source. 

You can use the vector pragma/directive with the novecremainder clause (as mentioned above) to disable vectorization of the peel loop generated by the compiler. 

It may be undesirable to have a dynamic-peel loop when the trip-count of the loop is expected to be small. Compiler uses any knowledge of actual trip-count of the loop (static constant, loop count pragma, etc.) before it decides to do dynamic peeling for alignment. But in many cases, this information may not be available to the compiler. One way to provide this information is to add a "#pragma loop_count (20)" in C/C++ or "!DIR$ LOOP COUNT (20)" in Fortran pragma/directive before the loop with the appropriate value for the trip-count for your application. 

Example:

% cat -n t2.c

           1  #include <stdio.h>
           2
           3  void foo1(float *a, float *b, float *c, int n)
           4  {
           5    int i;
           6  #pragma ivdep
           7    for (i=0; i<n; i++) {
           8      a[i] *= b[i] + c[i];
           9    }
          10  }
          11
          12  void foo2(float *a, float *b, float *c, int n)
          13  {
          14    int i;
          15  #pragma ivdep
          16    for (i=0; i<20; i++) {
          17      a[i] *= b[i] - c[i];
          18    }
          19  }

For loop at lines 7-8, compiler generates a kernel-vector-loop, unrolled after vectorization by a factor of 2, and a peel-loop and remainder loop, neither are vectorized.

For loop at lines 16-17, compiler takes advantage of the fact that the trip-count is a constant (20) and generates a kernel-loop that is vectorized and unrolled by 2. The remainder loop (of 4 iterations) is completely unrolled by the compiler and vectorized. There is no peel-loop generated.

Note that this optimization report is specific for Intel® AVX2 instructions. Compile for a different instruction set and the optimization report for vectorization may be different.

$ icc -O2 -qopt-report=5 -qopt-report-phase=vec -c -inline-level=0 -xcore-avx2 t2.c
LOOP BEGIN at t2.c(8,3)
<Peeled loop for vectorization>
LOOP END
LOOP BEGIN at t2.c(8,3)
   remark #15388: vectorization support: reference a[i] has aligned access   [ t2.c(9,5) ]
   remark #15388: vectorization support: reference a[i] has aligned access   [ t2.c(9,5) ]
   remark #15389: vectorization support: reference b[i] has unaligned access   [ t2.c(9,13) ]
   remark #15389: vectorization support: reference c[i] has unaligned access   [ t2.c(9,20) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15305: vectorization support: vector length 8
   remark #15399: vectorization support: unroll factor set to 2
   remark #15309: vectorization support: normalized vectorization overhead 0.889
   remark #15300: LOOP WAS VECTORIZED
   remark #15442: entire loop may be executed in remainder
   remark #15448: unmasked aligned unit stride loads: 1
   remark #15449: unmasked aligned unit stride stores: 1
   remark #15450: unmasked unaligned unit stride loads: 2
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 11
   remark #15477: vector cost: 1.120
   remark #15478: estimated potential speedup: 6.990
   remark #15488: --- end vector cost summary ---
LOOP END
LOOP BEGIN at t2.c(8,3)
<Remainder loop for vectorization>
   remark #15389: vectorization support: reference a[i] has unaligned access   [ t2.c(9,5) ]
   remark #15389: vectorization support: reference a[i] has unaligned access   [ t2.c(9,5) ]
   remark #15389: vectorization support: reference b[i] has unaligned access   [ t2.c(9,13) ]
   remark #15389: vectorization support: reference c[i] has unaligned access   [ t2.c(9,20) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15305: vectorization support: vector length 4
   remark #15309: vectorization support: normalized vectorization overhead 1.375
   remark #15301: REMAINDER LOOP WAS VECTORIZED
   remark #15442: entire loop may be executed in remainder
   remark #15448: unmasked aligned unit stride loads: 1
   remark #15449: unmasked aligned unit stride stores: 1
   remark #15450: unmasked unaligned unit stride loads: 2
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 11
   remark #15477: vector cost: 1.120
   remark #15478: estimated potential speedup: 6.990
   remark #15488: --- end vector cost summary ---
LOOP END
LOOP BEGIN at t2.c(8,3)
<Remainder loop for vectorization>
LOOP END
===========================================================================
Begin optimization report for: foo2(float *, float *, float *, int)
    Report from: Vector optimizations [vec]
LOOP BEGIN at t2.c(17,3)
   remark #15389: vectorization support: reference a[i] has unaligned access   [ t2.c(18,5) ]
   remark #15389: vectorization support: reference a[i] has unaligned access   [ t2.c(18,5) ]
   remark #15389: vectorization support: reference b[i] has unaligned access   [ t2.c(18,13) ]
   remark #15389: vectorization support: reference c[i] has unaligned access   [ t2.c(18,20) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15305: vectorization support: vector length 8
   remark #15399: vectorization support: unroll factor set to 2
   remark #15309: vectorization support: normalized vectorization overhead 0.500
   remark #15300: LOOP WAS VECTORIZED
   remark #15450: unmasked unaligned unit stride loads: 3
   remark #15451: unmasked unaligned unit stride stores: 1
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 11
   remark #15477: vector cost: 1.500
   remark #15478: estimated potential speedup: 2.750
   remark #15488: --- end vector cost summary ---
LOOP END

LOOP BEGIN at t2.c(17,3)
<Remainder loop for vectorization>
   remark #15389: vectorization support: reference a[i] has unaligned access   [ t2.c(18,5) ]
   remark #15389: vectorization support: reference a[i] has unaligned access   [ t2.c(18,5) ]
   remark #15389: vectorization support: reference b[i] has unaligned access   [ t2.c(18,13) ]
   remark #15389: vectorization support: reference c[i] has unaligned access   [ t2.c(18,20) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15305: vectorization support: vector length 4
   remark #15427: loop was completely unrolled
   remark #15309: vectorization support: normalized vectorization overhead 0.750
   remark #15301: REMAINDER LOOP WAS VECTORIZED
   remark #15450: unmasked unaligned unit stride loads: 3
   remark #15451: unmasked unaligned unit stride stores: 1
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 11
   remark #15477: vector cost: 1.500
   remark #15478: estimated potential speedup: 2.750
   remark #15488: --- end vector cost summary ---
LOOP END
===========================================================================

Increase the size of your arrays and use option -qopt-assume-safe-padding to improve performance:

This option determines whether the compiler assumes that variables and dynamically allocated memory are padded past the end of the object.

When -qopt-assume-safe-padding is specified, the compiler assumes that variables and dynamically allocated memory are padded. This means that code can access up to 64 bytes beyond what is specified in your program. The compiler does not add any padding for static and automatic objects when this option is used, but it assumes that code can access up to 64 bytes beyond the end of the object, wherever the object appears in the program. To satisfy this assumption, you must increase the size of static and automatic objects in your program when you use this option.

1. One example of where this option can help is in the sequences generated by the compiler for vector-remainder and vector-peel loops. This option may improve performance of memory operations in such loops.

If this option is used in the compilation above, compiler will assume that the arrays a, b, and c have a padding of at least 64 bytes beyond n.

If these arrays were allocated using malloc such as:

ptr = (float *)malloc(sizeof(float) * n);

then they should be changed by the user to say:

ptr = (float *)malloc(sizeof(float) * n + 64);

After making such changes to satisfy the legality requirements for using this option, if you add this option to the compilation above, you get the following (higher-performing) sequence for the peel-loop generated for loop at line 7:

..B2.7:                         # Preds ..B2.9 ..B2.6 Latency 13
        vpcmpgtd  %zmm0, %zmm2, %k0                             #7.3 c1
        nop                                                     #7.3 c5
        knot      %k0, %k1                                      #7.3 c9
        jkzd      ..B2.9, %k1   # Prob 20%                      #7.3 c13
                                # LOE rdx rbx rbp rsi rdi r9 r10 r11 r12 r13 r14 r15 eax ecx r8d zmm0 zmm1 zmm2 zmm3 k1
..B2.8:                         # Preds ..B2.7 Latency 53
        vmovaps   %zmm1, %zmm4                                  #8.13 c1
        vmovaps   %zmm1, %zmm5                                  #8.20 c5
        vmovaps   %zmm1, %zmm6                                  #8.5 c9
        vloadunpacklps (%rsi,%r10,4), %zmm4{%k1}                #8.13 c13
        vloadunpacklps (%rdx,%r10,4), %zmm5{%k1}                #8.20 c17
        vloadunpacklps (%rdi,%r10,4), %zmm6{%k1}                #8.5 c21
        vloadunpackhps 64(%rsi,%r10,4), %zmm4{%k1}              #8.13 c25
        vloadunpackhps 64(%rdx,%r10,4), %zmm5{%k1}              #8.20 c29
        vloadunpackhps 64(%rdi,%r10,4), %zmm6{%k1}              #8.5 c33
        vaddps    %zmm5, %zmm4, %zmm7                           #8.20 c37
        vmulps    %zmm7, %zmm6, %zmm8                           #8.5 c41
        nop                                                     #8.5 c45
        vpackstorelps %zmm8, (%rdi,%r10,4){%k1}                 #8.5 c49
        vpackstorehps %zmm8, 64(%rdi,%r10,4){%k1}               #8.5 c53
        movb      %al, %al                                      #8.5 c53
                                # LOE rdx rbx rbp rsi rdi r9 r10 r11 r12 r13 r14 r15 eax ecx r8d zmm0 zmm1 zmm2 zmm3
..B2.9:                         # Preds ..B2.7 ..B2.8 Latency 9
        addq      $16, %r10                                     #7.3 c1
        vpaddd    %zmm3, %zmm2, %zmm2                           #7.3 c5
        cmpq      %r11, %r10                                    #7.3 c5
        jb        ..B2.7        # Prob 82%                      #7.3 c9

Without this option, the compiler generates this lower performing sequence for the peel-loop at line 7 using gather/scatter:

..B2.7:                         # Preds ..B2.9 ..B2.6 Latency 13
        vpcmpgtd  %zmm0, %zmm2, %k0                             #7.3 c1
        nop                                                     #7.3 c5
        knot      %k0, %k4                                      #7.3 c9
        jkzd      ..B2.9, %k4   # Prob 20%                      #7.3 c13
                                # LOE rax rdx rbx rbp rsi rdi r9 r11 r13 r15 ecx r8d r10d zmm0 zmm1 zmm2 zmm3 k4
..B2.8:                         # Preds ..B2.7 Latency 57
        vmovaps   .L_2il0floatpacket.10(%rip), %zmm8            #8.5 c1
        vmovaps   %zmm1, %zmm4                                  #8.13 c5
        lea       (%rsi,%r13), %r14                             #8.13 c5
        vmovaps   %zmm1, %zmm5                                  #8.20 c9
        kmov      %k4, %k2                                      #8.13 c9
..L15:                                                          #8.13
        vgatherdps (%r14,%zmm8,4), %zmm4{%k2}                   #8.13
        jkzd      ..L14, %k2    # Prob 50%                      #8.13
        vgatherdps (%r14,%zmm8,4), %zmm4{%k2}                   #8.13
        jknzd     ..L15, %k2    # Prob 50%                      #8.13
..L14:                                                          #
        vmovaps   %zmm1, %zmm6                                  #8.5 c21
        kmov      %k4, %k3                                      #8.20 c21
        lea       (%rdx,%r13), %r14                             #8.20 c25
        lea       (%rdi,%r13), %r12                             #8.5 c25
..L17:                                                          #8.20
        vgatherdps (%r14,%zmm8,4), %zmm5{%k3}                   #8.20
        jkzd      ..L16, %k3    # Prob 50%                      #8.20
        vgatherdps (%r14,%zmm8,4), %zmm5{%k3}                   #8.20
        jknzd     ..L17, %k3    # Prob 50%                      #8.20
..L16:                                                          #
        vaddps    %zmm5, %zmm4, %zmm7                           #8.20 c37
        kmov      %k4, %k1                                      #8.5 c37
..L19:                                                          #8.5
        vgatherdps (%r12,%zmm8,4), %zmm6{%k1}                   #8.5
        jkzd      ..L18, %k1    # Prob 50%                      #8.5
        vgatherdps (%r12,%zmm8,4), %zmm6{%k1}                   #8.5
        jknzd     ..L19, %k1    # Prob 50%                      #8.5
..L18:                                                          #
        vmulps    %zmm7, %zmm6, %zmm9                           #8.5 c49
        nop                                                     #8.5 c53
..L21:                                                          #8.5
        vscatterdps %zmm9, (%r12,%zmm8,4){%k4}                  #8.5
        jkzd      ..L20, %k4    # Prob 50%                      #8.5
        vscatterdps %zmm9, (%r12,%zmm8,4){%k4}                  #8.5
        jknzd     ..L21, %k4    # Prob 50%                      #8.5
..L20:                                                          #
                                # LOE rax rdx rbx rbp rsi rdi r9 r11 r13 r15 ecx r8d r10d zmm0 zmm1 zmm2 zmm3
..B2.9:                         # Preds ..B2.7 ..B2.8 Latency 9
        addq      $16, %rax                                     #7.3 c1
        addq      $64, %r13                                     #7.3 c1
        vpaddd    %zmm3, %zmm2, %zmm2                           #7.3 c5
        cmpq      %r9, %rax                                     #7.3 c5
        jb        ..B2.7        # Prob 82%                      #7.3 c9

 

2. Another example where the option is useful is in the handling of short integer type conversions. In this case, the code generated by the compiler under default options can be improved with the addition of the option -qopt-assume-safe-padding.

void foo(short * restrict a, short *restrict b, short * restrict c)
{
   int i;

   for(i = 0; i < N; i++) {
       a[i] = b[i] + c[i];
   }
}

In the main kernel loop, compiler adds checks for each load/store (to protect against memory faults). In the remainder loop, gather/scatters will be emitted:

Main kernel loop under default options:

..B1.6:
        lea       (%rax,%rsi), %r10
        vloadunpackld (%rax,%rsi){sint16}, %zmm1
        andq      $63, %r10
        cmpq      $32, %r10
        jle       ..L3
        vloadunpackhd 64(%rax,%rsi){sint16}, %zmm1
..L3:
        vprefetch1 256(%rax,%rsi)
        lea       (%rax,%rdx), %r10
        vloadunpackld (%rax,%rdx){sint16}, %zmm2
        andq      $63, %r10
        cmpq      $32, %r10
        jle       ..L4
        vloadunpackhd 64(%rax,%rdx){sint16}, %zmm2
..L4:
        vprefetch0 128(%rax,%rsi)
        vpaddd    %zmm2, %zmm1, %zmm3
        vprefetch1 256(%rax,%rdx)
        vpandd    %zmm0, %zmm3, %zmm4
        vprefetch0 128(%rax,%rdx)
        addq      $16, %rcx
        vprefetch1 256(%rax,%rdi)
        lea       (%rax,%rdi), %r10
        andq      $63, %r10
        cmpq      $32, %r10
        jle       ..L5
        vpackstorehd %zmm4{uint16}, 64(%rax,%rdi)
..L5:
        vpackstoreld %zmm4{uint16}, (%rax,%rdi)
        vprefetch0 128(%rax,%rdi)
        addq      $32, %rax
        cmpq      $992, %rcx
        jb        ..B1.6

Remainder loop generated by compiler under default options:

..L9:
        vpgatherdd 1984(%rdx,%zmm3,2){sint16}, %zmm1{%k2}
        jkzd      ..L8, %k2
        vpgatherdd 1984(%rdx,%zmm3,2){sint16}, %zmm1{%k2}
        jknzd     ..L9, %k2
..L8:
        vpaddd    %zmm1, %zmm0, %zmm2
        vpandd    .L_2il0floatpacket.3(%rip), %zmm2, %zmm4
        nop
..L11:
        vpscatterdd %zmm4{uint16}, 1984(%rdi,%zmm3,2){%k3}
        jkzd      ..L10, %k3    
        vpscatterdd %zmm4{uint16}, 1984(%rdi,%zmm3,2){%k3}
        jknzd     ..L11, %k3    
..L10:

When the option -qopt-assume-safe-padding is added, compiler generates the following higher performing versions for the main kernel loop and remainder loop:

Main kernel loop with -qopt-assume-safe-padding option:

..B1.6:
        vloadunpackld (%rax,%rsi){sint16}, %zmm1
        vprefetch1 256(%rax,%rsi)
        vloadunpackld (%rax,%rdx){sint16}, %zmm2
        vprefetch0 128(%rax,%rsi)
        vloadunpackhd 64(%rax,%rsi){sint16}, %zmm1
        vprefetch1 256(%rax,%rdx)
        vloadunpackhd 64(%rax,%rdx){sint16}, %zmm2
        vprefetch0 128(%rax,%rdx)
        vpaddd    %zmm2, %zmm1, %zmm3
        vprefetch1 256(%rax,%rdi)
        vpandd    %zmm0, %zmm3, %zmm4
        vprefetch0 128(%rax,%rdi)
        addq      $16, %rcx
        movb      %dl, %dl
        vpackstoreld %zmm4{uint16}, (%rax,%rdi)
        vpackstorehd %zmm4{uint16}, 64(%rax,%rdi)
        addq      $32, %rax
        cmpq      $992, %rcx
        jb        ..B1.6

Remainder loop with -qopt-assume-safe-padding option added (higher performing version with no gather/scatter):

        vloadunpackld 1984(%rsi){sint16}, %zmm0{%k1}
        vloadunpackld 1984(%rdx){sint16}, %zmm1{%k1}
        vloadunpackhd 2048(%rsi){sint16}, %zmm0{%k1}
        vloadunpackhd 2048(%rdx){sint16}, %zmm1{%k1}
        vpaddd    %zmm1, %zmm0, %zmm2
        vpandd    .L_2il0floatpacket.3(%rip), %zmm2, %zmm3
        nop
        vpackstoreld %zmm3{uint16}, 1984(%rdi){%k1}
        vpackstorehd %zmm3{uint16}, 2048(%rdi){%k1}
        movb      %al, %al

 

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon architecture. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back the main chapter Vectorization Essentials.