Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 3/28/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

5.8.5. No Stalls, High Occupancy Percentage, and Low Bandwidth

The structure of a kernel design might prevent it from leveraging all the available bandwidth that the accelerator board can offer.
Remember: An ideal kernel pipeline condition has a stall percentage of 0%, an occupancy percentage of 100%, and a bandwidth that equals the board's available bandwidth.
Figure 74. Example OpenCL Kernel and Profiler Analysis

In this example, the accelerator board can provide a bandwidth of 25600 megabytes per second (MB/s). However, the vector_add kernel is requesting (2 reads + 1 write) x 4 bytes x 294 MHz = 12 bytes/cycle x 294 MHz = 3528 GB/s, which is 14% of the available bandwidth. To increase the bandwidth, increase the number of tasks performed in each clock cycle.

Solutions for low bandwidth:

  • Automatically or manually vectorize the kernel to make wider requests
  • Unroll the innermost loop to make more requests per clock cycle
  • Delegate some of the tasks to another kernel