Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

ID 683176
Date 9/24/2018
Document Table of Contents No Stalls, High Occupancy Percentage, and Low Bandwidth Efficiency

The structure of a kernel design might prevent it from leveraging all the available bandwidth that the accelerator board can offer.
Remember: An ideal kernel pipeline condition has a stall percentage of 0%, an occupancy percentage of 100%, and a bandwidth that equals the board's available bandwidth.
Figure 72. Example OpenCL Kernel and Profiler Analysis

In this example, the accelerator board can provide a bandwidth of 25600 megabytes per second (MB/s). However, the vector_add kernel is requesting (2 reads + 1 write) x 4 bytes x 294 MHz = 12 bytes/cycle x 294 MHz = 3528 MB/s, which is 14% of the available bandwidth. To increase the bandwidth, increase the number of tasks performed in each clock cycle.

Solutions for low bandwidth:

  • Automatically or manually vectorize the kernel to make wider requests
  • Unroll the innermost loop to make more requests per clock cycle
  • Delegate some of the tasks to another kernel