Visible to Intel only — GUID: mwh1391807502532
Ixiasoft
Visible to Intel only — GUID: mwh1391807502532
Ixiasoft
7.2. Kernel Vectorization
Include the num_simd_work_items attribute in your kernel code to direct the offline compiler to perform more additions per work-item without modifying the body of the kernel. The following code fragment applies a vectorization factor of four to the original kernel code:
__attribute__((num_simd_work_items(4)))
__attribute__((reqd_work_group_size(64,1,1)))
__kernel void sum (__global const float * restrict a,
__global const float * restrict b,
__global float * restrict answer)
{
size_t gid = get_global_id(0);
answer[gid] = a[gid] + b[gid];
}
To use the num_simd_work_items attribute, you must also specify a required work-group size of the kernel using the reqd_work_group_size attribute. The work-group size you specify for reqd_work_group_size must be divisible by the value you assign to num_simd_work_items. In the code example above, the kernel has a fixed work-group size of 64 work-items. Within each work-group, the work-items are distributed evenly among the four SIMD vector lanes. After the offline compiler implements the four SIMD vector lanes, each work-item now performs four times more work.
The offline compiler vectorizes the code and might coalesce memory accesses. You do not need to change any kernel code or host code because the offline compiler applies these optimizations automatically.
You can vectorize your kernel code manually, but you must adjust the NDRange in your host application to reflect the amount of vectorization you implement. The following example shows the changes in the code when you duplicate operations in the kernel manually:
__kernel void sum (__global const float * restrict a,
__global const float * restrict b,
__global float * restrict answer)
{
size_t gid = get_global_id(0);
answer[gid * 4 + 0] = a[gid * 4 + 0] + b[gid * 4 + 0];
answer[gid * 4 + 1] = a[gid * 4 + 1] + b[gid * 4 + 1];
answer[gid * 4 + 2] = a[gid * 4 + 2] + b[gid * 4 + 2];
answer[gid * 4 + 3] = a[gid * 4 + 3] + b[gid * 4 + 3];
}
In this form, the kernel loads four elements from arrays a and b, calculates the sums, and stores the results into the array answer. Because the FPGA pipeline loads and stores data to neighboring locations in memory, you can manually direct the offline compiler to coalesce each group of four load and store operations.