Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference

ID 767253
Date 3/22/2024
Public
Document Table of Contents

vector

Tells the compiler that a loop should be vectorized according to the argument keywords.

Syntax

#pragma vector {always[assert]|aligned|unaligned|dynamic_align|nodynamic_align|temporal|nontemporal|[no]vecremainder|vectorlength(n1[, n2]...)}

Arguments

always [assert]

Instructs the compiler to override any efficiency heuristic during the decision to vectorize or not, and to vectorize non-unit strides or very unaligned memory accesses. It controls the vectorization of the subsequent loop in the program. It optionally takes the keyword assert.

If you specify assert, the compiler will generate a diagnostic message if the loop cannot be vectorized.

aligned

Instructs the compiler to use aligned data movement instructions for all array references when vectorizing.

unaligned

Instructs the compiler to use unaligned data movement instructions for all array references when vectorizing.

dynamic_align

Instructs the compiler to perform dynamic alignment optimization for the loop.

nodynamic_align

Disables dynamic alignment optimization for the loop.

temporal

Instructs the compiler to use temporal (that is, non-streaming) stores on systems based on all supported architectures, unless otherwise specified.

nontemporal

Instructs the compiler to use non-temporal (that is, streaming) stores on systems based on all supported architectures, unless otherwise specified.

When this keyword is specified, you must also insert any fences as required to ensure correct memory ordering within a thread or across threads. One typical way to do this is to insert a _mm_sfence intrinsic call just after the loops (such as the initialization loop) where the compiler may insert streaming store instructions.

vecremainder

Instructs the compiler to vectorize the remainder loop when the original loop is vectorized.

novecremainder

Instructs the compiler not to vectorize the remainder loop when the original loop is vectorized.

vectorlength (n1[, n2]...)

Instructs the vectorizer about which vector length/factor to use when generating the main vector loop.

Description

The vector pragma indicates that a loop should be vectorized according to the argument keywords specified.

The compiler does not apply the vector pragma to nested loops; each nested loop needs a preceding pragma statement. Place the pragma before the loop control statement.

Using the always keyword

When the always argument keyword is used, the pragma will ignore compiler efficiency heuristics for the subsequent loop. When assert is added, the compiler will generate a diagnostic message if the loop cannot be vectorized for any reason.

Using the aligned/unaligned keywords

When the aligned/unaligned argument keyword is used with this pragma, it indicates that the loop should be vectorized using aligned/unaligned data movement instructions for all array references. Specify only one argument keyword: aligned or unaligned.

CAUTION:

If you specify aligned as an argument, you must be sure that the loop is vectorizable using this pragma. Otherwise, the compiler generates incorrect code.

Using the dynamic_align and nodynamic_align keywords

Dynamic alignment is an optimization the compiler can perform to improve alignment of memory references inside the loop. It involves peeling iterations from the vector loop into a scalar loop (which may, in turn, also be vectorized) before the vector loop so that the vector loop aligns with a particular memory reference.

Specifying dynamic_align enables the optimization to be performed, but the compiler will still use efficiency heuristics to determine whether the optimization will be applied to the loop. Specifying nodynamic_align disables the optimization. By default, the compiler does not perform optimization.

Using the nontemporal and temporal keywords

The nontemporal and temporal argument keywords are used to control how the "stores" of register contents to storage are performed (streaming versus non-streaming) on systems based on Intel® 64 architectures.

By default, the compiler automatically determines whether a streaming store should be used for each variable.

Streaming stores may cause significant performance improvements over non-streaming stores for large numbers on certain processors. However, the misuse of streaming stores can significantly degrade performance.

Using the [no]vecremainder keyword

When the vecremainder argument keyword is used with this pragma, the compiler vectorizes both the main and remainder loops.

When the novecremainder argument keyword is used with this pragma, the compiler vectorizes the main loop, but it does not vectorize the remainder loop.

Using the vectorlength keyword

The n is an integer power of 2; the value must be 2, 4, 6, 8, 16, 32, or 64. If more than one value is specified, the vectorizer will choose one of the specified vector lengths based on a cost model decision.

NOTE:

Pragma vector should be used with care.

Overriding the efficiency heuristics of the compiler should only be done if the programmer is absolutely sure that vectorization will improve performance. Furthermore, instructing the compiler to implement all array references with aligned data movement instructions will cause a run-time exception in case some of the access patterns are actually unaligned.

Examples

Example using the vector aligned pragma

In the following example, the aligned argument keyword is used to request that the loop be vectorized with aligned instructions.

Note that the arrays are declared in such a way that the compiler cannot prove this is safe to vectorize.

void vec_aligned(float *a, int m, int c) {
  int i;
  // Alignment unknown but compiler will still generate aligned load/store instructions
  #pragma vector aligned
  for (i = 0; i < m; i++)
    a[i] = a[i] * c; 
}

Example using the vector always pragma

void vec_always(int *a, int *b, int m) {
  #pragma vector always
  for(int i = 0; i <= m; i++)
    a[32*i] = b[99*i]; 
}

Example using the vector nontemporal pragma

float a[1000]; 
void foo(int N){
  int i;
  #pragma vector nontemporal
  for (i = 0; i < N; i++) {
    a[i] = 1;
  } 
}

The following example shows the generated assembly. For large N, significant performance improvements result on systems with processors that have Streaming SIMD Extensions (SSE) support over non-streaming implementations.

  .B1.2: 
movntps XMMWORD PTR _a[eax], xmm0 
movntps XMMWORD PTR _a[eax+16], xmm0 
add eax, 32 
cmp eax, ebx
jl .B1.2