Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 12/19/2022
Public
Document Table of Contents

3.4. Loops in a Single Work-Item Kernel

The Intel® FPGA SDK for OpenCL™ Offline Compiler optimizes performance of single work-item kernels by pipelining loops.
The datapath of a loop within a single work-item kernel can contain multiple iterations in flight. This behavior is different from a loop within an NDRange kernel in that an NDRange kernel's loop contains multiple work-items (rather than loop iterations) in flight. In an optimally pipelined loop, a new loop iteration is launched every clock cycle. Launching one loop iteration per clock cycle maximizes pipeline efficiency and yields the best performance. As shown in the figure below, launching one loop per clock cycle allows a kernel to finish faster.
Figure 51. Comparison of the Launch Frequency of Loop Iterations Between a Non-Pipelined Loop and a Pipelined Loop

The number of clock cycles between the launch of one loop iteration and the next is called the loop's initiation interval (II). An optimally pipelined loop has an II value of 1 because a new loop iteration is launched every clock cycle.

The Intel® FPGA SDK for OpenCL™ Offline Compiler may not pipeline every loop in the kernel. If a loop is not pipelined, a loop iteration can not begin until the previous iteration finishes executing. In this case, only one loop iteration is active in the loop's datapath at a time. View the HTML report to find out which loops are pipelined, and for pipelined loops, what is their II.

Consider the following example:

kernel void simple_loop (unsigned N,
                         global unsigned* restrict b, 
                         global unsigned* restrict c, 
                         global unsigned* restrict out)
{
    for (unsigned i = 1; i < N; i++) { 
        c[i] = c[i-1] + b[i];
    }
    out[0] = c[N-1];
}
Figure 52. Hardware Datapath of the Kernel simple_loop

The figure depicts how the offline compiler uses loop pipelining to execute simple_loop efficiently. The figure shows that the loop's datapath contains three loop iterations at the same time. Therefore, this loop is pipelined. The figure also shows that a new loop iteration enters the datapath every clock cycle. Therefore, the loop has II=1.