Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 12/19/2022
Document Table of Contents

5.7.1. Stall, Occupancy, Bandwidth

For specific lines of kernel code, the Source View tab of the Intel® VTune Profiler GUI shows stall percentage, occupancy percentage, data transfer size, and average memory bandwidth.

For definitions of stall, occupancy, and bandwidth, refer to Types of Performance Data.

The Intel® FPGA SDK for OpenCL™ generates a pipeline architecture where work-items traverse through the pipeline stages sequentially (that is, in a pipeline-parallel manner). As soon as a pipeline stage becomes empty, a work-item enters and occupies the stage. Pipeline parallelism also applies to iterations of pipelined loops, where iterations enter a pipelined loop sequentially.

Figure 72. Simplified Representation of a Kernel Pipeline Instrumented with Performance Counters

The following are simplified equations that describe the Profiler calculates stall, occupancy, and bandwidth:

Note: ivalid_count in the bandwidth equation also includes the predicate=true input to the load-store unit.

Ideal kernel pipeline conditions:

  • Stall percentage equals 0%
  • Occupancy percentage equals 100%
  • Bandwidth equals the board's bandwidth

For a given location in the kernel pipeline if the sum of the stall percentage and the occupancy percentage approximately equals 100%, the Profiler identifies the location as the stall source. If the stall percentage is low, the Profiler identifies the location as the victim of the stall.

The Profiler reports a high occupancy percentage if the offline compiler generates a highly efficient pipeline from your kernel, where work-items or iterations are moving through the pipeline stages without stalling.

If all LSUs are accessed the same number of times, they have the same occupancy value.

  • If work-items cannot enter the pipeline consecutively, they insert bubbles into the pipeline.
  • In loop pipelining, loop-carried dependencies also form bubbles in the pipeline because of bubbles that exist between iterations.
  • If an LSU is accessed less frequently than other LSUs, such as the case when an LSU is outside a loop that contains other LSUs, this LSU has a lower occupancy value than the other LSUs.

The same rule regarding occupancy value applies to channels.

Did you find the information on this page useful?

Characters remaining:

Feedback Message