SIMD-Enabled Functions
SIMD-enabled functions (formerly called elemental functions) are a general language construct to express a data parallel algorithm. A SIMD-enabled function is written as a regular C/C++ function, and the algorithm within describes the operation on one element, using scalar syntax. The function can then be called as a regular C/C++ function to operate on a single element or it can be called in a data parallel context to operate on many elements.
How SIMD-Enabled Functions Work
When you write a SIMD-enabled function, the compiler generates short vector variants of the function that you requested, which can perform your function's operation on multiple arguments in a single invocation. The short vector variant may be able to perform multiple operations as fast as the regular implementation performs a single one by utilizing the vector instruction set architecture (ISA) in the CPU. When a call to a SIMD-enabled function occurs in a SIMD loop or another SIMD-enabled function, the compiler replaces the scalar call with the best fit from the available short-vector variants of the function.
In addition, when invoked from a
pragma omp
construct, the compiler may assign different copies of the SIMD-enabled functions to different threads (or workers), executing them concurrently. The end result is that your data parallel operation executes on the CPU utilizing both the parallelism available in the multiple cores and the parallelism available in the vector ISA. In other words, if the short vector function is called inside a parallel loop,
an auto-parallelized loop that is vectorized, you can achieve both vector-level and thread-level parallelism.
Declare a SIMD-Enabled Function
In order for the compiler to generate the short vector function, you need to use the appropriate syntax from below in your code:
Windows*:
Use the
__declspec(vector (
declaration, as follows:
clauses
))__declspec(vector (clauses)) return_type simd_enabled_function_name(parameters)
Linux*:
Use the
__attribute__((vector (
declaration, as follows:
clauses
)))__attribute__((vector (clauses))) return_type simd_enabled_function_name(parameters)
Alternately, you can use the following OpenMP* pragma, which requires the
[q or Q]openmp
or
[q or Q]openmp-simd
compiler option:
#pragma omp declare simd clauses
The clauses in the vector declaration may be used for achieving better performance by overriding defaults. These clauses at SIMD-enabled function definition declare one or several short vector variants for a SIMD-enabled functions. Multiple vector declarations with different set of clauses may be attached to one function in order to declare multiple different short vector variants available for a SIMD-enabled function.
The clauses are defined as follows:
- processor(cpuid)
- Tells the compiler to generate a vector variant using the instructions, the caller/callee interface, and the default vectorlength selection scheme suitable to the specified processor. Use of this clause is highly recommended, especially for processors with wider vector register support (i.e.,core_2nd_gen_avxand newer).cpuidtakes one of the following values:
- core_4th_gen_avx_tsx
- core_4th_gen_avx
- core_3rd_gen_avx
- core_2nd_gen_avx
- core_aes_pclmulqdq
- core_i7_sse4_2
- atom
- core_2_duo_sse4_1
- core_2_duo_ssse3
- pentium_4_sse3
- pentium_m
- pentium_4
- haswell
- broadwell
- skylake
- skylake_avx512
- knl
- knm
- vectorlength(n)/simdlen(n)(foromp declare simd)
- Wherenis a vector length that is a power of 2, no greater than 32.Thesimdlenclause tells the compiler that each routine invocation at the call site should execute the computation equivalent tontimes the scalar function execution. When omitted the compiler selects the vector length automatically depending on the routine return value, parameters, and/or the processor clause. When multiple vector variants are called from one vectorization context (for example, two different functions called from the same vector loop), explicit use of identicalsimdlenvalues are advised to achieve good performance
- linear(list_item[, list_item...])wherelist_itemis one of:param[:step],val(param[:step]),ref(param[:step]), oruval(param[:step])
- Thelinearclause tells the compiler that for each consecutive invocation of the routine in a serial execution, the value ofparamis incremented bystep, whereparamis a formal parameter of the specified function or the C++ keywordthis. Thelinearclause can be used on parameters that are either scalar (non-arrays and of non-structured types), pointers, or C++ references.stepis a compile-time integer constant expression, which defaults to 1 if omitted.If more than one step is specified for a particular parameter, a compile-time error occurs.Multiplelinearclauses will be merged as a union.The meaning of each variant of the clause is as follows:
- linear(param[:step])
- For parameters that are not C++ references: the clause tells the compiler that on each iteration of the loop from which the routine is called the value of the parameter will be incremented bystep. The clause can also be used for C++ references for backward compatibility, but it is not recommended.
- linear(val(param[:step]))
- For parameters that are C++ references: the clause tells the compiler that on each iteration of the loop from which the routine is called the referenced value of the parameter will be incremented bystep.
- linear(uval(param[:step]))
- For C++ references: means the same as linear(val()). It differs from linear(val()) so that in case of linear(val()) a vector of references is passed to vector variant of the routine but in case of linear(uval()) only one reference is passed (and thus linear(uval()) is better to use in terms of performance).
- linear(ref(param[:step]))
- For C++ references: means that the reference itself is linear, i.e. the referenced values (that form a vector for calculations) are located sequentially, like in array with the distance between elements equal tostep.
- uniform(param [, param,]…)
- Whereparamis a formal parameter of the specified function or the C++ keywordthis.Theuniformclause tells the compiler that the values of the specified arguments can be broadcast to all iterations as a performance optimization. It is often useful in generating more favorable vector memory references. On the other hand, lack ofuniformclause may allow broadcast operations to be hoisted out of the caller loop. Evaluate carefully the performance implications. Multiple uniform clauses are merged as a union.
- mask / nomask
- Themaskandnomaskclauses tell the compiler to generate only masked or unmasked (respectively) vector variants of the routine. When omitted, both masked and unmasked variants are generated. The masked variant is used when the routine is called conditionally.
- inbranch / notinbranch
- Theinbranchandnotinbranchclauses are used with#pragma omp declare simd. Theinbranchclause works the same as themaskclause above and thenotinbranchclause works the same as thenomaskclause above.
Write the code inside your function using existing C/C++ syntax and relevant built-in functions (see the section on
__intel_simd_lane()
below).
Usage of Vector Function Specifications
You may define several vector variants for one routine with each variant reflecting a possible usage of the routine. Encountering a call, the compiler matches vector variants with actual parameter kinds and chooses the best match. Matching is done by priorities. In other words, if an actual parameter is the loop invariant and the
uniform
clause was specified for the corresponding formal parameter, then the variant with the
uniform
clause has a higher priority. Linear specifications have the following order, from high priority to low:
linear(uval())
,
linear()
,
linear(val())
,
linear(ref())
. Consider the following example loops with the calls to the same routine.
Example: OpenMP*
|
---|
|
SIMD-Enabled Functions and C++
You should use SIMD-enabled functions in modern C++ with caution: C++ imposes strict requirements on compilation and execution environments which may not compose well with semantically-rich language extensions such as SIMD-enabled functions. There are three key aspects of C++ that interrelate with SIMD-enabled functions concept: exception handling, dynamic polymorphism, and the C++ type system.
SIMD-Enabled Functions and Exception Handling
Exceptions are currently not supported in SIMD contexts: exceptions cannot be thrown and/or caught in SIMD loops and SIMD-enabled functions. Therefore, all SIMD-enabled functions are considered
noexcept
in C++11 terms. This affects not only short vector variants of a function, but its original scalar routine as well. This is enforced when the function is compiled: it is checked against throw construct and against function calls throwing exceptions. It is also enforced when the SIMD-enabled function call is compiled.
SIMD-Enabled Functions and Dynamic Polymorphism
Vector attributes can be applied to virtual functions of classes with some limitations and taken into account during polymorphic virtual function calls. The syntax of vector declarations is the same as for regular SIMD-enabled class methods: just attach vector declarations as described above to the method declarations inside the class declaration.
Vector function attributes for virtual methods are inherited. If a vector attribute is specified for an overriding virtual function, it must match that of the overridden function. Even if the virtual method implementation is overridden in a derived class the vector declarations are inherited and applied. A set of vector variants is produced for the override according to vector variants set on parent. This rule also applies when the parent does not have any vector variants. If some virtual method is introduced as non-SIMD-enabled (no vector declarations supplied) it cannot become SIMD-enabled in the derived class even if the derived class contains its own implementation of the virtual method.
Matching vector variants for a virtual methods is done by the declared (static) type of an object for which the method is called. The actual (dynamic) type of an object may either coincide with the static type or be inherited from it.
Unlike regular function calls which transfer control to one target function, the call target of a virtual function depends on the dynamic type of the object for which the method is called and accomplished indirectly via the virtual function table of a class. In a single SIMD chunk, the virtual method may be invoked for objects of multiple classes, for example, elements of a polymorphic collection. This requires multiple calls to different targets within a single SIMD chunk. This works as follows:
- If a SIMD-enabled virtual function call is matched to a variant with a uniformthisparameter, multiple calls are not needed. The compiler makes an indirect call to the matched vector variant of a virtual method of the object's dynamic class.
- If a SIMD-enabled virtual function call is matched to a variant with a non-uniformthisparameter, all objects in a SIMD chunk may still share the same virtual method implementation. This is checked and a single, indirect call to the matched vector variant of the target virtual method implementation is invoked.
- Otherwise, lanes sharing virtual call targets are masked-in and a masked vector variant corresponding to the match is invoked in a loop for each unique virtual call target. If a masked variant is not provided for matching a vector variant and athisparameter is not declared uniform, the match will be rejected.
The following example illustrates SIMD-enabled virtual functions:
Example: OpenMP*
|
---|
|
The following are limitations to SIMD-enabled virtual function support:
- Multiple inheritance, including virtual inheritance, is not supported for classes having SIMD-enabled virtual methods. This is because calls to virtual functions in multiple inheritance cases may be done through special functions called thunks which adjust the 'this' pointer and/or virtual function table pointer. The current implementation doesn't support thunks for SIMD-enabled virtual calls because in this case thunks should themselves become SIMD-enabled functions which is not implemented.
- It is not possible to get the address of a SIMD-enabled virtual method. Support of SIMD-enabled virtual functions would require additional information, so their binary representation is different. Such cases will not be handled properly by code expecting a regular pointer to the virtual member.
SIMD-Enabled Functions and the C++ Type System
Vector attributes are attributes in the C++11 sense and so are not part of a functional type of SIMD-Enabled functions. Vector attributes are bound to the function itself, an instance of a functional type. This has the following implications:
- Template instantiations having SIMD-enabled functions as template parameters won't catch vector attributes, so it is currently impossible to reliably preserve vector attributes in function wrapper templates likestd::bindwhich add indirection. This indirection may sometimes be optimized away by compiler and the resulting direct call will have all vector attributes associated.
- There is no way to overload or specialize templates by vector attributes.
- There is no way to write functional traits to capture vector attributes for the sake of template metaprogramming.
The example below depicts various situations where this situation may be observed:
Example: OpenMP*
|
---|
|
If calls to
caller1
,
caller2
and
caller3
are inlined, the compiler is able to replace indirect calls by direct calls in all cases. In this case
caller2(function, arr)
and
caller3(function, arr)
both call short vector variants of a function as result of the usual replacement of direct calls to function() by matching short vector variants in the SIMD loop.
Invoke a SIMD-Enabled Function with Parallel Context
Typically, the invocation of a SIMD-enabled function provides arrays wherever scalar arguments are specified as formal parameters.
The following two invocations will give instruction-level parallelism by having the compiler issue special vector instructions.
a[:] = ef_add(b[:],c[:]); //operates on the whole extent of the arrays a, b, c
a[0:n:s] = ef_add(b[0:n:s],c[0:n:s]); //use the full array notation construct to also specify n as an extend and s as a stride
The array notation syntax, as well as calling the SIMD-enabled function from the regular
for
loop, results in invoking the short vector function in each iteration and utilizing the vector parallelism but the invocation is done in a serial loop, without utilizing multiple cores.
Use of array notation syntax and SIMD-enabled functions in a regular
for
loop results in invoking the short vector function in each iteration and utilizing the vector parallelism, but the invocation is done in a serial loop without utilizing multiple cores.
Use the
__intel_simd_lane() Built-in Function
__intel_simd_lane()
Built-in FunctionWhen called from within a vectorized loop, the
__intel_simd_lane()
built-in function will return a number between 0 and
vectorlength
- 1 that reflects the current "lane id" within the SIMD vector.
__intel_simd_lane()
will return zero if the loop is not vectorized. Calling
__intel_simd_lane()
outside of an explicit vector programming construct is discouraged. It may prevent auto-vectorization and such a call often results in the function returning zero instead of a value between 0 and
vectorlength
-1.
To see how
__intel_simd_lane()
can be used, consider the following example:
void accumulate(float *a, float *b, float *c, d){
*a+=sin(d);
*b+=cos(d);
*c+=log(d);
}
for (i=low; i<high; i++){
accumulate(&suma, &sumb, &sumc, d[i]);
}
Example: OpenMP*
|
---|
|
The gather-scatter type memory addressing caused by the references to arrays A, B, and C in the SIMD-enabled function
accumulate()
will significantly hurt performance making the whole conversion useless. To avoid this penalty you may use the
__intel_simd_lane()
built-in function as follows:
Example: OpenMP*
|
---|
|
With use of __intel_simd_lane() the references to the arrays in accumulate() will have unit-stride.
Limitations
The following language constructs are not allowed within SIMD-enabled functions:
- TheGOTOstatement.
- Theswitchstatement with 16 or morecasestatements.
- Operations onclassesandstructs(other than member selection).
- Any OpenMP* construct.