Intel® High Level Synthesis Compiler Pro Edition: Best Practices Guide

ID 683152
Date 10/04/2021

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

5.2.3. Example: Loop Pipelining and Unrolling

Consider a design where you want to perform a dot-product of every column of a matrix with each other column of a matrix, and store the six results in a different upper-triangular matrix. The rest of the elements of the matrix should be set to zero.

The code might look like the following code example:
1.	#define ROWS 4
2.	#define COLS 4
4.	component void dut(...) {
5.		float a_matrix[COLS][ROWS]; // store in column-major format
6.		float r_matrix[ROWS][COLS]; // store in row-major format
8.		// setup...
10.		for (int i = 0; i < COLS; i++) {
11.			for (int j = i + 1; j < COLS; j++) {
13.				float dotProduct = 0;
14.				for (int mRow = 0; mRow < ROWS; mRow++) {
15.					dotProduct += a_matrix[i][mRow] * a_matrix[j][mRow];
16.				}
17.				r_matrix[i][j] = dotProduct;
18.			}
19.		}
21.	 // continue...
23.	}

You can improve the performance of this component by unrolling the loops that iterate across each entry of a particular column. If the loop operations are independent, then the compiler executes them in parallel.

Floating-point operations typically must be carried out in the same order that they are expressed in your source code to preserve numerical precision. However, you can use the -ffp-contract=fast compiler flag to relax the ordering of floating-point operations. With the order of floating-point operations relaxed, the following conditions occur in this loop:
  • The multiplication operations can occur in parallel.
  • The addition operations can be composed into an adder tree instead of an adder chain.
To learn more, review the tutorial: <quartus_installdir>/hls/examples/ tutorials/best_practices/ floating_point_ops

The compiler tries to unroll loops on its own when it thinks unrolling improves performance. For example, the loop at line 14 is automatically unrolled because the loop has a constant number of iterations, and does not consume much hardware (ROWS is a constant defined at compile-time, ensuring that this loop has a fixed number of iterations).

You can improve the throughput by unrolling the j-loop at line 11, but to allow the compiler to unroll the loop, you must ensure that it has constant bounds. You can ensure constant bounds by starting the j-loop at j = 0 instead of j = i + 1. You must also add a predication statement to prevent r_matrix from being assigned with invalid data during iterations 0,1,2,…i of the j-loop.
01: #define ROWS 4
02: #define COLS 4
04: component void dut(...) {
05: 	float a_matrix[COLS][ROWS]; // store in column-major format
06: 	float r_matrix[ROWS][COLS]; // store in row-major format
08: 	// setup...
10: 	for (int i = 0; i < COLS; i++) {
12: #pragma unroll
13: 			for (int j = 0; j < COLS; j++) {
14: 				float dotProduct = 0;
16: #pragma unroll

17: 				for (int mRow = 0; mRow < ROWS; mRow++) {
18: 					dotProduct += a_matrix[i][mRow] * a_matrix[j][mRow];
19: 				}
21: 				r_matrix[i][j] = (j > i) ? dotProduct : 0; // predication
22: 			}
23: 		}
24:  	}
26: 	// continue...
28: }

Now the j-loop is fully unrolled. Because they do not have any dependencies, all four iterations run at the same time.

Refer to the resource_sharing_filter tutorial located at <quartus_installdir>/hls/examples/tutorials/best_practices/resource_sharing_filter for more details.

You could continue and also unroll the loop at line 10, but unrolling this loop would result in the area increasing again. By allowing the compiler to pipeline this loop instead of unrolling it, you can avoid increasing the area and pay about only four more clock cycles assuming that the i-loop only has an II of 1. If the II is not 1, the Details pane of the Loops Analysis page in the high-level design report (report.html) gives you tips on how to improve it.

The following factors are factors that can typically affect loop II:
  • loop-carried dependencies

    See the tutorial at <quartus_installdir>/hls/examples/tutorials/best_practices/loop_memory_dependency

  • long critical loop path
  • inner loops with a loop II > 1