To maximize CPU vector unit utilization, try to use vector data types in your kernel code. This technique enables you to map vector data types directly to the hardware vector registers. Thus, the data types used should match the width of the underlying SIMD instructions.
Consider the following recommendations:
On the 2nd Generation Intel® Core™ Processors and higher with Intel® AVX support, use data types such as
, so you bind code to the specific register width of the underlying hardware. This method provides maximum performance on a specific platform. However, performance on other platforms and supported Intel processors might be less than optimal.
You may use wider data types, such as
, to transparently cover many SIMD hardware register widths. However, using types wider than the underlying hardware is similar to loop unrolling. This method might improve performance in some cases, but also increases register pressure. Still consider using
data type to process four pixels simultaneously when operating on eight-bit-per-component pixels.
When manually “vectorizing” an original kernel that uses scalar data types (like
to use vector data types (like
) instead, remember that each work-item processes
example). Make sure you reduce the global size accordingly, so it is dividable by
data type improves performance for the 4th Generation Intel® Core™ processors and higher.
Using this coding technique, you plan the vector-level parallelism yourself instead of relying on the implicit vectorization module (see the "Benefitting from Implicit Vectorization" section). This approach is useful in the following scenarios:
You are porting code originally used Intel SSE/AVX/AVX2 instructions.
You want to benefit from hand-tuned vectorization of your code.
The following example shows a multiplication kernel that targets the 256-bit vector units of the 2nd Generation Intel Core Processors:
void edp_mul(__constant float8 *a,
__constant float8 *b,
__global float8 *result)
int id = get_global_id(0);
result[id] = a[id]* b[id];
In this example, the data passed to the kernel represents buffers of
. The calculations are performed on eight elements together.
The attribute added before the kernel, signals the compiler, or the implementation that this kernel has an optimized vectorized form, so the implicit vectorization module does not operate on it. Use
to indicate to the compiler that your kernel already processes data using mostly vector types. For more details on this attribute, see the OpenCL™ 1.2 Specification.