Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 10/04/2021
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

10.2.1. Simplifying Loop-Carried Dependencies in Intel® Stratix® 10 OpenCL Designs

To ensure that your Intel® Stratix® 10 OpenCL design achieves optimal performance, ensure that loop-carried computation is as simple as possible so that the Intel® FPGA SDK for OpenCL™ Offline Compiler can compute in one clock cycle.

The offline compiler cannot pipeline computations used for loop-carried dependencies. Loops that contain many complex computations limit the amount of retiming optimizations that the compiler can perform because the compiler cannot make any functional changes to the loop path. Even if II=1, the HTML report identifies the fMAX bottleneck. Use this information in conjunction with the information presented in the Loop Analysis report pane to assess the most critical paths in your design.

If a loop-carried dependency contains logic that the offline compiler cannot compute in one clock cycle, one mitigation approach is to lengthen the dependency distance. The dependency distance is the number of loop iterations that occur from when the compiler reads the value to when the next value becomes available. The Loop analysis report within the High Level Design Report identifies the most complex loop dependency.

Automated Loop-Carried Dependency Optimization

For Intel® Stratix® 10 designs, the Intel® FPGA SDK for OpenCL™ Offline Compiler attempts to automate the incrementation of a value modulo N (mod N) on every iteration of a loop.

You can apply this optimization manually for any operation that is associative and communicative. If you refactor the code this way, the compiler can spread the computation across two or more loop-carried variables, and it can recombine the computation when the value is needed in a non-loop-carried computation. For more information, refer to Safari, Nima et al. "Methods for Implementation of Feedback Loops in High Speed FPGA Applications". 24th International Conference on Field Programmable Logic and Applications (FPL) (2014) doi:10.1109/FPL.2014.6927434.

Intel® recommends this optimization for operations that are on your design's critical path. Consider the following example:

int i = 0;
int N = 256;
while (!done) {
   i++;
   if (i == N) i = 0;
   <use i for some computation…>
}

On each loop iteration, the offline compiler must increment a value, compare it to a constant, and then reset the value if necessary. To optimize this code, the compiler effectively breaks down the expression and spreads the computation across two clock cycles to increase the dependence distance. The side effect of this optimization is a small increase in logic usage.

There are scenarios in which the offline compiler might not optimize the example code:

  • If the initial value of i is non-zero, and the compiler cannot determine that the initial value is between 0 and N, the compiler might not be able to guarantee that the forms above are functionally equivalent.
  • If any condition causes i to be modified or reset to 0, the offline compiler does not apply the optimization.