ID 683176
Date 9/24/2018
Public

## 6.2. Kernel Vectorization

To achieve higher throughput, you can vectorize your kernel. Kernel vectorization allows multiple work-items to execute in a single instruction multiple data (SIMD) fashion. You can direct the to translate each scalar operation in the kernel, such as addition or multiplication, to an SIMD operation.

Include the num_simd_work_items attribute in your kernel code to direct the offline compiler to perform more additions per work-item without modifying the body of the kernel. The following code fragment applies a vectorization factor of four to the original kernel code:

__attribute__((num_simd_work_items(4)))
__attribute__((reqd_work_group_size(64,1,1)))
__kernel void sum (__global const float * restrict a,
__global const float * restrict b,
{
size_t gid = get_global_id(0);

}

To use the num_simd_work_items attribute, you must also specify a required work-group size of the kernel using the reqd_work_group_size attribute. The work-group size you specify for reqd_work_group_size must be divisible by the value you assign to num_simd_work_items. In the code example above, the kernel has a fixed work-group size of 64 work-items. Within each work-group, the work-items are distributed evenly among the four SIMD vector lanes. After the offline compiler implements the four SIMD vector lanes, each work-item now performs four times more work.

The offline compiler vectorizes the code and might coalesce memory accesses. You do not need to change any kernel code or host code because the offline compiler applies these optimizations automatically.

You can vectorize your kernel code manually, but you must adjust the NDRange in your host application to reflect the amount of vectorization you implement. The following example shows the changes in the code when you duplicate operations in the kernel manually:

__kernel void sum (__global const float * restrict a,
__global const float * restrict b,
{
size_t gid = get_global_id(0);

answer[gid * 4 + 0] = a[gid * 4 + 0] + b[gid * 4 + 0];
answer[gid * 4 + 1] = a[gid * 4 + 1] + b[gid * 4 + 1];
answer[gid * 4 + 2] = a[gid * 4 + 2] + b[gid * 4 + 2];
answer[gid * 4 + 3] = a[gid * 4 + 3] + b[gid * 4 + 3];
}

In this form, the kernel loads four elements from arrays a and b, calculates the sums, and stores the results into the array answer. Because the FPGA pipeline loads and stores data to neighboring locations in memory, you can manually direct the offline compiler to coalesce each group of four load and store operations.

Attention: Each work-item handles four times as much work after you implement the manual optimizations. As a result, the host application must use an NDRange that is four times smaller than in the original example. On the contrary, you do not need to adjust the NDRange size when you exploit the automatic vectorization capabilities of the offline compiler. You can adjust the vector width with minimal code changes by using the num_simd_work_items attribute.