7.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization
Consider a case where a kernel with a num_simd_work_items attribute set to 16 does not fit in the FPGA. The kernel might fit if you modify it by duplicating a narrower SIMD kernel compute unit. Determining the optimal balance between the number of compute units and the SIMD width might require some experimentation. For example, duplicating a four lane-wide SIMD kernel compute unit three times might achieve better throughput than duplicating an eight lane-wide SIMD kernel compute unit twice.
The following example code shows how you can combine the num_compute_units and num_simd_work_items attributes in your OpenCL™ code:
__attribute__((num_simd_work_items(4))) __attribute__((num_compute_units(3))) __attribute__((reqd_work_group_size(8,8,1))) __kernel void matrixMult(__global float * restrict C, __global float * restrict A, . . .
The figure below illustrates the data flow of the kernel described above. The num_compute_units implements three replicated compute units. The num_simd_work_items implements four SIMD vector lanes.