Contents

unroll Pragma

Loop unrolling involves replicating a loop body multiple times and reducing the trip count of a loop. Unroll loops to reduce or eliminate loop control overhead on the FPGA. In cases where there are no loop-carried dependencies and the
Intel® oneAPI
DPC++/C++
Compiler
can perform loop iterations in parallel, unrolling loops can also reduce latency and overhead.
Unrolling of nested loops with large bounds might generate huge number of instructions that could lead to very long compile times.
The compiler might unroll simple loops even if a pragma does not annotate them. To direct the compiler to unroll a loop, or to explicitly not unroll a loop, insert an
unroll
kernel pragma in the kernel code preceding a loop you want to unroll. To specify an unroll factor
N
, use the optional unroll factor specifier
#pragma unroll <N>
Determining the Correct Unroll Factor
section in Unrolling Loops FPGA tutorial.
Syntax
```#pragma unroll

#pragma unroll N```
If you specify the unroll factor
N
, the factor must be a positive constant expression of integer type. If you omit the unroll factor
N
, the loop is unrolled fully.
Examples
The following is an example of full loop unrolling:
``````// Before unrolling loop
#pragma unroll
for(i = 0 ; i < 5; i++){
a[i] += 1;
}``````
``````// After fully unrolling the loop by a factor of 5,
// the loop is flattened. There is no loop after unrolling.
a[0] += 1;
a[1] += 1;
a[2] += 1;
a[3] += 1;
a[4] += 1;``````
You can observe that a full unroll is a special case where the unroll factor is equal to the number of loop iterations.
The following is an example of partial loop unrolling:
``````// Before unrolling loop
#pragma unroll 4
for(i = 0 ; i < 20; i++){
a[i] += 1;
}
``````
``````// After the loop is unrolled by a factor of 4,
// the loop has five (20 / 4) iterations.
for(i = 0 ; i < 5; i++){
a[i * 4] += 1;
a[i * 4 + 1] += 1;
a[i * 4 + 2] += 1;
a[i * 4 + 3] += 1;
}``````
In the partial unroll example, each loop iteration in the unrolled loop is equivalent to four iterations. The
Intel® oneAPI
DPC++/C++
Compiler
instantiates four adders instead of one adder. Because there is no data dependency between iterations in the loop (which is true in this case), the compiler executes four adds in parallel.
For additional information, refer to the FPGA tutorial sample "Loop Unroll" listed in the Intel® oneAPI Samples Browser on Linux* or Windows*, or access the code sample on GitHub.
Notes
• Provide an
unroll
factor whenever possible. To specify an unroll factor
N
, insert the
#pragma unroll <N>
directive before a loop in your kernel code. The
Intel® oneAPI
DPC++/C++
Compiler
attempts to unroll the loop at most
<N>
times. Consider the following code fragment. By assigning a value of 2 as the unroll factor, you direct the compiler to unroll the loop twice.
``````#pragma unroll 2
for(size_t k = 0; k < 4; k++)
{
mac += data_in[(gid * 4) + k] * coeff[k];
}``````