5. Loop Best Practices
The Intel® HLS Compiler Pro Edition lets you know if there are any dependencies that prevent it from optimizing your loops. Try to eliminate these dependencies in your code for optimal component performance. You can also provide additional guidance to the compiler by using the available loop pragmas.
- Manually fuse adjacent loop bodies when the instructions in those loop bodies can be performed in parallel. These fused loops can be pipelined instead of being executed sequentially. Pipelining reduces the latency of your component and can reduce the FPGA area your component uses.
- Use the #pragma loop_coalesce directive to have the compiler attempt to collapse nested loops. Coalescing loops reduces the latency of your component and can reduce the FPGA area overhead needed for nested loops.
- If you have two loops that can execute in parallel, consider using a system of tasks. For details, see System of Tasks Best Practices.
Tutorials Demonstrating Loop Best Practices
The Intel® HLS Compiler Pro Edition comes with a number of tutorials that illustrate important Intel® HLS Compiler concepts and demonstrate good coding practices.
|You can find these tutorials in the following location on your Intel® Quartus® Prime system:
|best_practices/ divergent_loops||Demonstrates a source-level optimization for designs with divergent loops|
|best_practices/ loop_coalesce||Demonstrates the performance and resource utilization improvements of using loop_coalesce pragma on nested loops.|
|best_practices/ loop_fusion||Demonstrates the latency and resource utilization improvements of loop fusion.|
|best_practices/ loop_memory_dependency||Demonstrates breaking loop-carried dependencies using the ivdep pragma.|
Demonstrates a method to reduce the area utilization of a loop that meets the following conditions:
|best_practices/ optimize_ii_using_ hls_register||Demonstrates how to use the hls_register attribute to reduce loop II and how to use hls_max_concurrency to improve component throughput|
|best_practices/ parallelize_array_operation||Demonstrates how to improve fMAX by correcting a bottleneck that arises when performing operations on an array in a loop.|
Demonstrates a method to reduce the II of a loop that includes a floating point accumulator, or other reduction operation that cannot be computed at high speed in a single clock cycle.
|best_practices/ remove_loop_carried_dependency||Demonstrates how to improve loop performance by removing accesses to the same variable across nested loops.|
|best_practices/ resource_sharing_filter||Demonstrates the following versions of a 32-tap finite impulse response (FIR) filter design:
|best_practices/ speculated_iterations||Demonstrates how to use #pragma speculated_iterations to control when speculated iterations are used.|
|best_practices/ triangular_loop||Demonstrates a method for describing triangular loop patterns with dependencies.|