Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 12/19/2022
Document Table of Contents

7.3.1. Compute Unit Replication versus Kernel SIMD Vectorization

In most cases, you should implement the num_simd_work_items attribute to increase data processing efficiency before using the num_compute_units attribute.

Both the num_compute_units and num_simd_work_items attributes increase throughput by increasing the amount of hardware that the Intel® FPGA SDK for OpenCL™ Offline Compiler uses to implement your kernel. The num_compute_units attribute modifies the number of compute units to which work-groups can be scheduled, which also modifies the number of times a kernel accesses global memory. In contrast, the num_simd_work_items attribute modifies the amount of work a compute unit can perform in parallel on a single work-group. The num_simd_work_items attribute duplicates only the datapath of the compute unit by sharing the control logic across each SIMD vector lane.

Generally, using the num_simd_work_items attribute leads to more efficient hardware than using the num_compute_units attribute to achieve the same goal. The num_simd_work_items attribute also allows the offline compiler to coalesce your memory accesses.

Figure 77. Compute Unit Replication versus Kernel SIMD Vectorization

Multiple compute units competing for global memory might lead to undesired memory access patterns. You can alter the undesired memory access pattern by introducing the num_simd_work_items attribute instead of the num_compute_units attribute. In addition, the num_simd_work_items attribute potentially offers the same computational throughput as the equivalent kernel compute unit duplication that the num_compute_units attribute offers.

You cannot implement the num_simd_work_items attribute in your kernel under the following circumstances:

  • The value you specify for num_simd_work_items is not 2, 4, 8 or 16.
  • The value of reqd_work_group_size is not divisible by num_simd_work_items.

    For example, the following declaration is incorrect because 50 is not divisible by 4:

  • Kernels with complex control flows. You cannot vectorize lines of code within a kernel in which different work-items follow different control paths (for example, the control paths depend on get_global_ID or get_local_ID).

During kernel compilation, the offline compiler issues messages informing you whether the implementation of vectorization optimizations is successful. Kernel vectorization is successful if the reported vectorization factor matches the value you specify for the num_simd_work_items attribute.