Intel® FPGA SDK for OpenCL™ Pro Edition: Programming Guide

ID 683846
Date 12/13/2021

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

5.2.9. Loop Interleaving Control (max_interleaving Pragma)

The Intel® FPGA SDK for OpenCL™ Offline Compiler attempts to maximize the throughput and hardware resource occupancy of pipelined inner loops in a loop nest by issuing new inner loop iterations as frequently as possible (minimizing the loop initiation interval). When the compiler cannot achieve a loop II of 1 for an inner loop, the compiler configures the loop nest to interleave iterations of one invocation of the inner loop with iterations of other invocations of the inner loop.

As an example, consider the loop nest in the following code snippet:

// Loop j is pipelined with ii=1

for (int j = 0; j < M; j++) {

  int a[N];

  // Loop i is pipelined with ii=2

  for (int i = 1; i < N; i++) {
      a[i] = foo(i)

In this example, the inner i loop is pipelined with a loop II of 2. Under normal pipelining, this means that the inner loop hardware only achieves 50% utilization since one i iteration is initiated every other cycle. To take advantage of these idle cycles, the compiler interleaves a second invocation of the i loop from the next iteration of the outer j loop. Here, a loop invocation means to start pipelined execution of a loop body. In this example, since the i loop resides inside the j loop, and the j loop has a trip count of M, the i loop is invoked M times. Since the j loop is an outermost loop, it is invoked once. The following table illustrates the difference between normal pipelined execution of the i loop and interleaved execution for this example where N=5:

Table 1.  Difference Between Normal Pipelined Execution and Interleaved Execution
Cycle Pipelined Interleaved
0 (0,0) (0,0)
1 --- (1,0)
2 (0,1) (0,1)
3 --- (1,1)
4 (0,2) (0,2)
5 --- (1,2)
6 (0,3) (0,3)
7 --- (1,3)
8 (0,4) (0,4)
9 --- (1,4)
10 (1,0) (2,0)
11 --- (3,0)
12 (1,1) (2,1)
13 --- (3,1)
14 (1,2) (2,2)
15 --- (3,2)
16 (1,3) (2,3)
17 --- (3,3)
18 (1,4) (2,4)
19 --- (3,4)

The table shows the values (j,i) for each inner loop iteration that is initiated at each cycle. At cycle 0, both modes of execution initiate the (0,0)th iteration of the i loop. Under normal pipelined execution, no i loop iteration is initiated at cycle 1. Under interleaved execution, the (1,0)th iteration of the innermost loop, that is, the first iteration of the next (j=1) invocation of the i loop is initiated. By cycle 10, interleaved execution has initiated all of the iterations of both the j=0 invocation of the i loop and the j=1 invocation of the i loop. This represents twice the efficiency of the normal pipelined execution.

In some cases, you may decide that the performance benefit from interleaving is not equal to the area cost associated with enabling interleaving. In these cases, you may want to limit or restrict the amount of interleaving to reduce FPGA area utilization. To limit the number of interleaved invocations of an inner loop that can be executed simultaneously, annotate the inner loop with the max_interleaving pragma. The annotated loop must be contained inside another pipelined loop. The required parameter ( n) specifies an upper bound on the degree of interleaving allowed, that is, how many invocations of the containing loop can execute the annotated loop at a given time.

Specify the max_interleaving pragma in one of the following ways:

  • #pragma max_interleaving 1

    The compiler restricts the annotated (inner) loop to be invoked only once per outer loop iteration. That is, all iterations of the inner loop travels the pipeline before the next invocation of the inner loop can occur.

  • #pragma max_interleaving 0

    The compiler allows the pipeline to contain a number of simultaneous invocations of the inner loop equal to the loop initiation interval (II) of the inner loop. For example, an inner loop with an II of 2 can have iterations from two invocations in the pipeline at a time. This behavior is the default behavior for the compiler if you do not specify the max_interleaving pragma.

In the following code snippet, the compiler restricts the pipelined execution of the i loop. A new invocation of the i loop corresponds only to the subsequent iteration of the j loop.
// Loop j is pipelined with ii=1
for (int j = 0; j < M; j++) {
  int a[N];
  // Loop i is pipelined with ii=2 
  #pragma max_interleaving 1
  for (int i = 1; i < N; i++) {
      a[i] = foo(i)