Unroll Loops
You can control the way the
Intel® oneAPI
translates SYCL kernel descriptions to hardware resources by unrolling loops. Loop unrolling decreases the number of iterations that the
DPC++/C++
CompilerIntel® oneAPI
executes at the expense of increased hardware resource consumption and increases performance. See also
Unroll LoopsDPC++/C++
CompilerConsider the SYCL code for a parallel application in which each work-item is responsible for computing the accumulation of four elements in an array:
queue.submit([&](handler &cgh) {
accessor x(x_buf, cgh, read_only);
accessor sum(sum_buf, cgh, write_only);
cgh.single_task<class unoptimzed>([=]() {
int accum = 0;
for (size_t i = 0; i < 4; i++) {
accum += x[i + get_global_id(0) * 4];
}
sum[get_global_id(0)] = accum;
});
});
Observe the following three main operations that occur in this kernel:
- Load operations from inputx
- Accumulation
- Store operations to output sum
The
Intel® oneAPI
arranges these operations in a pipeline according to the data flow semantics of the SYCL* kernel code. For example, the
DPC++/C++
CompilerIntel® oneAPI
implements loops by forwarding the results from the end of the pipeline to the top of the pipeline, depending on the loop exit condition.
DPC++/C++
CompilerThe SYCL kernel performs one loop iteration of each work-item per clock cycle. With sufficient hardware resources, you can increase kernel performance by unrolling the loop, which decreases the number of iterations that the kernel executes. To unroll a loop, add a
#pragma unroll
directive to the main loop, as shown in the following code example:
Loop unrolling significantly changes the structure of the compute unit that the
Intel® oneAPI
creates.
DPC++/C++
Compilerqueue.submit([&](handler &cgh) {
accessor x(x_buf, cgh, read_only);
accessor sum(sum_buf, cgh, write_only);
cgh.single_task<class unoptimzed>([=]() {
int accum = 0;
#pragma unroll
for (size_t i = 0; i < 4; i++) {
accum += x[i + get_global_id(0) * 4];
}
sum[get_global_id(0)] = accum;
});
});
In this example, the
#pragma unroll
directive causes the
Intel® oneAPI
to unroll four iterations of the loop completely. To accomplish the unrolling, the
DPC++/C++
CompilerIntel® oneAPI
expands the pipeline by tripling the number of addition operations and loading four times more data. With the removal of the loop, the compute unit assumes a feed-forward structure. As a result, the compute unit can store the DPC++/C++
Compilersum
elements in every clock cycle after the completion of the initial load operations and additions. The
Intel® oneAPI
further optimizes this kernel by coalescing the four load operations so that the compute unit can load all necessary input data to calculate a result in one load operation.
DPC++/C++
CompilerFactors to Consider for Loop Unrolling
- Avoid nested looping structures. Instead, implement a large single loop or unroll inner loops by adding the#pragma unrolldirective whenever possible. For example, if you compile a kernel that has a heavily nested loop structure, wherein each loop includes a#pragma unrolldirective, you might experience a long compilation time. TheIntel® oneAPImight fail to meet scheduling because it cannot unroll this nested loop structure easily, resulting in a high II. In this case, theDPC++/C++CompilerIntel® oneAPIissues the following error message along with the line number of the outermost loop:DPC++/C++CompilerKernel <function> exceeded the Max II. The Kernel's resource usage is estimated to be much larger than FPGA capacity. It will perform poorly even if it fits. Reduce resource utilization of the kernel by reducing loop unroll factors within it (if any) or otherwise reduce amount of computation within the kernel.
- Unrolling the loop and coalescing load operations from global memory allow the hardware implementation of the kernel to perform more operations per clock cycle.
- TheIntel® oneAPImight not be able to unroll a loop completely under the following circumstances:DPC++/C++Compiler
- You specify complete unrolling of a data-dependent loop with a very large number of iterations. Consequently, the hardware implementation of your kernel might not fit into the FPGA.
- You specify complete unrolling and the loop bounds are not constants.
- The loop consists of complex control flows (for example, a loop containing complex array indexes or exit conditions that are unknown at compilation time).
For the last two cases listed above, theIntel® oneAPIissues the following warning:DPC++/C++CompilerFull unrolling of the loop is requested but the loop bounds cannot be determined. The loop is not unrolled.To enable loop unrolling in these situations, specify the#pragma unroll <N>directive, where<N>is the unroll factor. The unroll factor limits the number of iterations that theIntel® oneAPIunrolls. Refer to Single Work-item Kernel Design Guidelines for tips on constructing well-structured loops.DPC++/C++Compiler