Intel High Level Synthesis Compiler Standard Edition: Best Practices Guide
Version Information
Updated for: |
---|
Intel® Quartus® Prime Design Suite 19.1 |
1. Intel HLS Compiler Standard Edition Best Practices Guide
In this publication, <quartus_installdir> refers to the location where you installed Intel® Quartus® Prime Design Suite.
- Windows
- C:\intelFPGA_standard\19.1
- Linux
- /home/<username>/intelFPGA_standard/19.1
About the Intel® HLS Compiler Standard Edition Documentation Library
Title and Description | |
---|---|
Release
Notes
Provide late-breaking information about the Intel® HLS Compiler. |
Link |
Getting Started
Guide
Get up and running with the Intel® HLS Compiler by learning how to initialize your compiler environment and reviewing the various design examples and tutorials provided with the Intel® HLS Compiler. |
Link |
User Guide
Provides instructions on synthesizing, verifying, and simulating intellectual property (IP) that you design for Intel FPGA products. Go through the entire development flow of your component from creating your component and testbench up to integrating your component IP into a larger system with the Intel Quartus Prime software. |
Link |
Reference
Manual
Provides reference information about the features supported by the Intel HLS Compiler. Find details on Intel® HLS Compiler command options, header files, pragmas, attributes, macros, declarations, arguments, and template libraries. |
Link |
Best Practices
Guide
Provides techniques and practices that you can apply to improve the FPGA area utilization and performance of your HLS component. Typically, you apply these best practices after you verify the functional correctness of your component. |
Link |
Quick
Reference
Provides a brief summary of Intel HLS Compiler declarations and attributes on a single two-sided page. |
Link |
2. Best Practices for Coding and Compiling Your Component
-
Interface Best Practices
With the Intel® High Level Synthesis Compiler, your component can have a variety of interfaces: from basic wires to the Avalon Streaming and Avalon Memory-Mapped Master interfaces. Review the interface best practices to help you choose and configure the right interface for your component.
-
Loop Best Practices
The Intel® High Level Synthesis Compiler pipelines your loops to enhance throughput. Review these loop best practices to learn techniques to optimize your loops to boost the performance of your component.
-
Memory Architecture Best Practices
The Intel® High Level Synthesis Compiler infers efficient memory architectures (like memory width, number of banks and ports) in a component by adapting the architecture to the memory access patterns of your component. Review the memory architecture best practices to learn how you can get the best memory architecture for your component from the compiler.
-
Datatype Best Practices
The datatypes in your component and possible conversions or casting that they might undergo can significantly affect the performance and FPGA area usage of your component. Review the datatype best practices for tips and guidance how best to control datatype sizes and conversions in your component.
- Alternative Algorithms
The Intel® High Level Synthesis Compiler lets you compile a component quickly to get initial insights into the performance and area utilization of your component. Take advantage of this speed to try larger algorithm changes to see how those changes affect your component performance.
3. Interface Best Practices
With the Intel® High Level Synthesis Compiler, your component can have a variety of interfaces: from basic wires to the Avalon Streaming and Avalon Memory-Mapped Master interfaces. Review the interface best practices to help you choose and configure the right interface for your component.
Each interface type supported by the Intel® HLS Compiler Standard Edition has different benefits. However, the system that surrounds your component might limit your choices. Keep your requirements in mind when determining the optimal interface for your component.
Demonstrating Interface Best Practices
The Intel® HLS Compiler comes with tutorials that give you working examples to review and run. They demonstrate good coding practices and illustrate important concepts.
Tutorial | Description |
---|---|
You can find these tutorials in the
following location on your
Intel®
Quartus® Prime
system:<quartus_installdir>/hls/examples/tutorials |
|
interfaces/ overview | Demonstrates the effects on quality-of-results (QoR) of choosing different component interfaces even when the component algorithm remains the same. |
best_practices/ parameter_aliasing | Demonstrates the use of the restrict keyword on component arguments. |
interfaces/ explicit_streams_buffer |
Demonstrates how to use explicit stream_in and stream_out interfaces in the component and testbench. |
interfaces/explicit_streams_packets_ready_valid | Demonstrates how to use the usesPackets, usesValid, and usesReady stream template parameters. |
interfaces/ mm_master_testbench_operators | Demonstrates how to invoke a component at different indicies of an Avalon Memory Mapped (MM) Master (mm_master class) interface. |
interfaces/ mm_slaves | Demonstrates how to create Avalon-MM Slave interfaces (slave registers and slave memories). |
interfaces/ multiple_stream_call_sites | Demonstrates the tradeoffs of using multiple stream call sites. |
interfaces/ pointer_mm_master | Demonstrates how to create Avalon-MM Master interfaces and control their parameters. |
interfaces/ stable_arguments | Demonstrates how to use the stable attribute for unchanging arguments to improve resource utilization. |
3.1. Choose the Right Interface for Your Component
Different component interfaces can affect the quality of results (QoR) of your component without changing your component algorithm. Consider the effects of different interfaces before choosing the interface for your component.
The best interface for your component might not be immediately apparent, so you might need to try different interfaces for your component to achieve the optimal QoR. Take advantage of the rapid component compilation time provided by the Intel® HLS Compiler and the resulting High Level Design reports to determine which interface gives you the optimal QoR for your component.
This section uses a vector addition example to illustrate the impact of changing the component interface while keeping the component algorithm the same. The example has two input vectors, vector a and vector b, and stores the result to vector c. The vectors have a length of N (which could be very large).
#pragma unroll 8 for (int i = 0; i < N; ++i) { c[i] = a[i] + b[i]; }
The Intel® HLS Compiler extracts the parallelism of this algorithm by pipelining the loops if no loop dependency exists. In addition, by unrolling the loop (by a factor of 8), more parallelism can be extracted.
Ideally, the generated component has a latency of N/8 cycles. In the examples in the following section, a value of 1024 is used for N, so the ideal latency is 128 cycles (1024/8).
The following sections present variations of this example that use different interfaces. Review these sections to learn how different interfaces affect the QoR of this component.
You can work your way through the variations of these examples by reviewing the tutorial available in <quartus_installdir>/hls/examples/tutorials/interfaces/overview.
3.1.1. Pointer Interfaces
Pointers in a component are implemented as Avalon® Memory Mapped ( Avalon® MM) master interfaces with default settings. For more details about pointer parameter interfaces, see Intel HLS Compiler Default Interfaces in Intel® High Level Synthesis Compiler Standard Edition Reference Manual.
component void vector_add(int* a, int* b, int* c, int N) { #pragma unroll 8 for (int i = 0; i < N; ++i) { c[i] = a[i] + b[i]; } }

The following Loop Analysis report shows that the component has an undesirably high loop initiation interval (II). The II is high because vectors a, b, and c are all accessed through the same Avalon MM Master interface. The Intel® HLS Compiler uses stallable arbitration logic to schedule these accesses, which results in poor performance and high FPGA area use.
In addition, the compiler cannot assume there are no data dependencies between loop iterations because pointer aliasing might exist. The compiler cannot determine that vectors a, b, and c do not overlap. If data dependencies exist, the Intel® HLS Compiler cannot pipeline the loop iterations effectively.

QoR Metric | Value |
---|---|
ALMs | 15593.5 |
DSPs | 0 |
RAMs | 30 |
fMAX (MHz)2 | 298.6 |
Latency (cycles) | 24071 |
Initiation Interval (II) (cycles) | ~508 |
1The compilation flow used to calculate the QoR metrics used Intel® Quartus® Prime Pro Edition Version 17.1. |
2The fMAX measurement was calculated from a single seed. |
3.1.2. Avalon Memory Mapped Master Interfaces
By default, pointers in a component are implemented as Avalon® Memory Mapped (MM) master interfaces with default settings. You can mitigate poor performance from the default settings by configuring the Avalon® MM master interfaces.
You can configure the Avalon® MM master interface for the vector addition component example using the ihc::mm_master class as follows:
component void vector_add( ihc::mm_master<int, ihc::aspace<1>, ihc::dwidth<8*8*sizeof(int)>, ihc::align<8*sizeof(int)> >& a, ihc::mm_master<int, ihc::aspace<2>, ihc::dwidth<8*8*sizeof(int)>, ihc::align<8*sizeof(int)> >& b, ihc::mm_master<int, ihc::aspace<3>, ihc::dwidth<8*8*sizeof(int)>, ihc::align<8*sizeof(int)> >& c, int N) { #pragma unroll 8 for (int i = 0; i < N; ++i) { c[i] = a[i] + b[i]; } }
- The vectors are each assigned to different address spaces with the ihc::aspace attribute, and each vector receives
a separate
Avalon®
MM master interface.
With the vectors assigned to different physical interfaces, the vectors can be accessed concurrently without interfering with each other, so memory arbitration is not needed.
- The width of the interfaces for the vectors is adjusted with the ihc::dwidth attribute.
- The alignment of the interfaces for the vectors is adjusted with the ihc::align attribute.

The diagram shows that vector_add.B2 has two loads and one store. The default Avalon® MM master settings used by the code example in Pointer Interfaces had 16 loads and 8 stores.
By expanding the width and alignment of the vector interfaces, the original pointer interface loads and stores were coalesced into one wide load each for vector a and vector b, and one wide store for vector c.
Also, the memories are stall-free because the loads and stores in this example access separate memories.
QoR Metric | Pointer | Avalon MM Master |
---|---|---|
ALMs | 15593.5 | 643 |
DSPs | 0 | 0 |
RAMs | 30 | 0 |
fMAX (MHz)2 | 298.6 | 472.37 |
Latency (cycles) | 24071 | 142 |
Initiation Interval (II) (cycles) | ~508 | 1 |
1The compilation flow used to calculate the QoR metrics used Intel® Quartus® Prime Pro Edition Version 17.1. |
2The fMAX measurement was calculated from a single seed. |
3.1.3. Avalon Memory Mapped Slave Interfaces
When you allocate a slave memory, you must define its size. Defining the size puts a limit on how large a value of N that the component can process. In this example, the RAM size is 1024 words. This RAM size means that N can have a maximal size of 1024 words.
component void vector_add( hls_avalon_slave_memory_argument(1024*sizeof(int)) int* a, hls_avalon_slave_memory_argument(1024*sizeof(int)) int* b, hls_avalon_slave_memory_argument(1024*sizeof(int)) int* c, int N) { #pragma unroll 8 for (int i = 0; i < N; ++i) { c[i] = a[i] + b[i]; } }

QoR Metric | Pointer | Avalon® MM Master | Avalon® MM Slave |
---|---|---|---|
ALMs | 15593.5 | 643 | 490.5 |
DSPs | 0 | 0 | 0 |
RAMs | 30 | 0 | 48 |
fMAX (MHz)2 | 298.6 | 472.37 | 498.26 |
Latency (cycles) | 24071 | 142 | 139 |
Initiation Interval (II) (cycles) | ~508 | 1 | 1 |
1The compilation flow used to calculate the QoR metrics used Intel® Quartus® Prime Pro Edition Version 17.1. |
2The fMAX measurement was calculated from a single seed. |
3.1.4. Avalon Streaming Interfaces
Avalon® Streaming (ST) interfaces support a unidirectional flow of data, and are typically used for components that drive high-bandwidth and low-latency data.
struct int_v8 { int data[8]; }; component void vector_add( ihc::stream_in<int_v8>& a, ihc::stream_in<int_v8>& b, ihc::stream_out<int_v8>& c, int N) { for (int j = 0; j < (N/8); ++j) { int_v8 av = a.read(); int_v8 bv = b.read(); int_v8 cv; #pragma unroll 8 for (int i = 0; i < 8; ++i) { cv.data[i] = av.data[i] + bv.data[i]; } c.write(cv); } }
An Avalon® ST interface has a data bus, and ready and busy signals for handshaking. The struct is created to pack eight integers so that eight operations at a time can occur in parallel to provide a comparison with the examples for other interfaces. Similarly, the loop count is divided by eight.

The streaming interfaces are stallable from the upstream sources and the downstream output. Because the interfaces are stallable, the loop initiation interval (II) is approximately 1 (instead of exactly 1). If the component does not receive any bubbles (gaps in data flow) from upstream or stall signals from downstream, then the component achieves the desired II of 1.
If you know that the stream interfaces will never stall, you can further optimize this component by taking advantage of the usesReady and usesValid stream parameters.
QoR Metric | Pointer | Avalon® MM Master | Avalon® MM Slave | Avalon® ST |
---|---|---|---|---|
ALMs | 15593.5 | 643 | 490.5 | 314.5 |
DSPs | 0 | 0 | 0 | 0 |
RAMs | 30 | 0 | 48 | 0 |
fMAX (MHz)2 | 298.6 | 472.37 | 498.26 | 389.71 |
Latency (cycles) | 24071 | 142 | 139 | 134 |
Initiation Interval (II) (cycles) | ~508 | 1 | 1 | 1 |
1The compilation flow used to calculate the QoR metrics used Intel® Quartus® Prime Pro Edition Version 17.1. |
2The fMAX measurement was calculated from a single seed. |
3.1.5. Pass-by-Value Interface
For software developers accustomed to writing code that targets a CPU, passing each element in an array by value might be unintuitive because it typically results in many function calls or large parameters. However, for code targeting an FPGA, passing array elements by value can result in smaller and simpler hardware on the FPGA.
struct int_v8 { int data[8]; }; component int_v8 vector_add( int_v8 a, int_v8 b) { int_v8 c; #pragma unroll 8 for (int i = 0; i < 8; ++i) { c.data[i] = a.data[i] + b.data[i]; } return c; }
This component takes and processes only eight elements of vector a and vector b, and returns eight elements of vector c. To compute 1024 elements for the example, the component needs to be called 128 times (1024/8). While in previous examples the component contained loops that were pipelined, here the component is invoked many times, and each of the invocations are pipelined.

QoR Metric | Pointer | Avalon® MM Master | Avalon® MM Slave | Avalon® ST | Pass-by-Value |
---|---|---|---|---|---|
ALMs | 15593.5 | 643 | 490.5 | 314.5 | 130 |
DSPs | 0 | 0 | 0 | 0 | 0 |
RAMs | 30 | 0 | 48 | 0 | 0 |
fMAX (MHz)2 | 298.6 | 472.37 | 498.26 | 389.71 | 581.06 |
Latency (cycles) | 24071 | 142 | 139 | 134 | 128 |
Initiation Interval (II) (cycles) | ~508 | 1 | 1 | 1 | 1 |
1The compilation flow used to calculate the QoR metrics used Intel® Quartus® Prime Pro Edition Version 17.1. |
2The fMAX measurement was calculated from a single seed. |
3.2. Avoid Pointer Aliasing
Add a restrict type qualifier to pointer types whenever possible. By having restrict-qualified pointers, you prevent the Intel® HLS Compiler Standard Edition from creating unnecessary memory dependencies between nonconflicting read and write operations.
The restrict type qualifier is restrict.
Consider a loop where each iteration reads data from one array, and then it writes data to another array in the same physical memory. Without adding the restrict type qualifier to these pointer arguments, the compiler must assume that the two arrays overlap. Therefore, the compiler must keep the original order of memory accesses to both arrays, resulting in poor loop optimization or even failure to pipeline the loop that contains the memory accesses.
<quartus_installdir>/hls/examples/tutorials/best_practices/parameter_aliasing
4. Loop Best Practices
The reports generated by the Intel® HLS Compiler Standard Edition let you know if there are any dependencies that prevent it from optimizing your loops. Try to eliminate these dependencies in your code for optimal component performance. You can also provide additional guidance to the compiler by using the available loop pragmas.
- Manually fuse adjacent loop bodies when the instructions in those loop bodies can be performed in parallel. These fused loops can be pipelined instead of being executed sequentially. Pipelining reduces the latency of your component and can reduce the FPGA area your component uses.
- Use the #pragma loop_coalesce directive to have the compiler attempt to collapse nested loops. Coalescing loops reduces the latency of your component and can reduce the FPGA area overhead needed for nested loops.
Tutorials Demonstrating Loop Best Practices
The Intel® HLS Compiler Standard Edition comes with a number of tutorials that give you working examples to review and run so that you can see good coding practices as well as demonstrating important concepts.
Tutorial | Description |
---|---|
You can find these tutorials in the
following location on your
Intel®
Quartus® Prime
system:<quartus_installdir>/hls/examples/tutorials |
|
best_practices/ loop_memory_dependency | Demonstrates breaking loop-carried dependencies using the ivdep pragma. |
best_practices/ resource_sharing_filter | Demonstrates the following versions of a 32-tap finite
impulse response (FIR) filter design:
|
4.1. Reuse Hardware By Calling It In a Loop
Loops are a useful way to reuse hardware. If your component function calls another function, the called function will be the top-level component. Calling a function multiple times results in hardware duplication.
int foo(int a) { return 4 + sqrt(a) / } component void myComponent() { ... int x = x += foo(0); x += foo(1); x += foo(2); ... }
component void myComponent() { ... int x = 0; #pragma unroll 1 for (int i = 0; i < 3; i++) { x += foo(i); } ... }
component void myComponent() { ... int x = 0; #pragma unroll 1 for (int i = 0; i < 3; i++) { int val = 0; switch(i) { case 0: val = 3; break; case 1: val = 6; break; case 2: val = 1; break; } x += foo(val); } ... } H
You can learn more about reusing hardware and minimizing inlining be reviewing the resource sharing tutorial available in <quartus_installdir>/hls/examples/tutorials/best_practices/resource_sharing_filter.
4.2. Parallelize Loops
You can take advantage of the spatial compute structure to accelerate the loops by having multiple iterations of a loop executing concurrently. To have multiple iterations of a loop execute concurrently, unroll loops when possible and structure your loops so that dependencies between loop iterations are minimized and can be resolved within one clock cycle.
4.2.1. Pipeline Loops


This loop is pipelined with a loop initiation interval (II) of 1. An II of 1 means that there is a delay of 1 clock cycle between starting each successive loop iteration.
The Intel® HLS Compiler attempts to pipeline loops by default, and loop pipelining is not subject to the same constant iteration count constraint that loop unrolling is.
Not all loops can be pipelined as well as the loop shown in Figure 7, particularly loops where each iteration depends on a value computed in a previous iteration.
For example, consider if Stage 1 of the loop depended on a value computed during Stage 3 of the previous loop iteration. In that case, the second (orange) iteration could not start executing until the first (blue) iteration had reached Stage 3. This type of dependency is called a loop-carried dependency.
In this example, the loop would be pipelined with II=3. Because the II is the same as the latency of a loop iteration, the loop would not actually be pipelined at all. You can estimate the overall latency of a loop with the following equation:
where is the number of cycles the loop takes to execute and is the number of cycles a single loop iteration takes to execute.
The Intel® HLS Compiler Standard Edition supports pipelining nested loops without unrolling inner loops. When calculating the latency of nested loops, apply this formula recursively. This recursion means that having II>1 is more problematic for inner loops than for outer loops. Therefore, algorithms that do most of their work on an inner loop with II=1 still perform well, even if their outer loops have II>1.
4.2.2. Unroll Loops


You can control how the compiler unrolls a loop with the #pragma unroll directive, but this directive works only if the compiler knows the trip count for the loop in advance or if you specify the unroll factor. In addition to replicating the hardware, the compiler also reschedules the circuit such that each operation runs as soon as the inputs for the operation are ready.
For an example of using the #pragma unroll directive, see the best_practices/resource_sharing_filter tutorial.
4.2.3. Example: Loop Pipelining and Unrolling
1. #define ROWS 4 2. #define COLS 4 3. 4. component void dut(...) { 5. float a_matrix[COLS][ROWS]; // store in column-major format 6. float r_matrix[ROWS][COLS]; // store in row-major format 7. 8. // setup... 9. 10. for (int i = 0; i < COLS; i++) { 11. for (int j = i + 1; j < COLS; j++) { 12. 13. float dotProduct = 0; 14. for (int mRow = 0; mRow < ROWS; mRow++) { 15. dotProduct += a_matrix[i][mRow] * a_matrix[j][mRow]; 16. } 17. r_matrix[i][j] = dotProduct; 18. } 19. } 20. 21. // continue... 22. 23. }
You can improve the performance of this component by unrolling the loops that iterate across each entry of a particular column. If the loop operations are independent, then the compiler executes them in parallel.
Floating-point operations typically must be carried out in the same order that they are expressed in your source code to preserve numerical precision. However, you can use the --fp-relaxed compiler flag to relax the ordering of floating-point operations. With the order of floating-point operations relaxed, all of the multiplications in this loop can occur in parallel. To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/ best_practices / floating_point_ops .
The compiler tries to unroll loops on its own when it thinks unrolling improves performance. For example, the loop at line 14 is automatically unrolled because the loop has a constant number of iterations, and does not consume much hardware (ROWS is a constant defined at compile-time, ensuring that this loop has a fixed number of iterations).
01: #define ROWS 4 02: #define COLS 4 03: 04: component void dut(...) { 05: float a_matrix[COLS][ROWS]; // store in column-major format 06: float r_matrix[ROWS][COLS]; // store in row-major format 07: 08: // setup... 09: 10: for (int i = 0; i < COLS; i++) { 11: 12: #pragma unroll 13: for (int j = 0; j < COLS; j++) { 14: float dotProduct = 0; 15: 16: #pragma unroll 17: for (int mRow = 0; mRow < ROWS; mRow++) { 18: dotProduct += a_matrix[i][mRow] * a_matrix[j][mRow]; 19: } 20: 21: r_matrix[i][j] = (j > i) ? dotProduct : 0; // predication 22: } 23: } 24: } 25: 26: // continue... 27: 28: }
Now the j-loop is fully unrolled. Because they do not have any dependencies, all four iterations run at the same time.
Refer to the resource_sharing_filter tutorial located at <quartus_installdir>/hls/examples/tutorials/best_practices/resource_sharing_filter for more details.
You could continue and also unroll the loop at line 10, but unrolling this loop would result in the area increasing again. By allowing the compiler to pipeline this loop instead of unrolling it, you can avoid increasing the area and pay about only four more clock cycles assuming that the i-loop only has an II of 1. If the II is not 1, the Details pane of the Loops Analysis page in the high-level design report (report.html) gives you tips on how to improve it.
- loop-carried dependencies
See the tutorial at <quartus_installdir>/hls/examples/tutorials/best_practices/loop_memory_dependency
- long critical loop path
- inner loops with a loop II > 1
4.3. Construct Well-Formed Loops
A well-formed loop has an exit condition that compares against an integer bound and has a simple induction increment of one per iteration. The Intel® HLS Compiler Standard Edition can analyze well-formed loops efficiently, which can help improve the performance of your component.
for(int i=0; i < N; i++) { //statements }
Well-formed nested loops can also help maximize the performance of your component.
for(int i=0; i < N; i++) { //statements for(int j=0; j < M; j++) { //statements } }
4.4. Minimize Loop-Carried Dependencies
The loop structure below has a loop-carried dependency because each loop iteration reads data written by the previous iteration. As a result, each read operation cannot proceed until the write operation from the previous iteration completes. The presence of loop-carried dependencies reduces of pipeline parallelism that the Intel® HLS Compiler Standard Edition can achieve, which reduces component performance.
for(int i = 1; i < N; i++) { A[i] = A[i - 1] + i; }
The Intel® HLS Compiler performs a static memory dependency analysis on loops to determine the extent of parallelism that it can achieve. If the Intel® HLS Compiler cannot determine that there are no loop-carried dependencies, it assumes that loop-dependencies exist. The ability of the compiler to test for loop-carried dependencies is impeded by unknown variables at compilation time or if array accesses in your code involve complex addressing.
To avoid unnecessary loop-carried dependencies and help the compiler to better analyze your loops, follow these guidelines:
Avoid Pointer Arithmetic
Compiler output is suboptimal when your component accesses arrays by dereferencing pointer values derived from arithmetic operations. For example, avoid accessing an array as follows:
for(int i = 0; i < N; i++) { int t = *(A++); *A = t; }
Introduce Simple Array Indexes
Some types of complex array indexes cannot be analyzed effectively, which might lead to suboptimal compiler output. Avoid the following constructs as much as possible:- Nonconstants in array indexes.
For example, A[K + i], where i is the loop index variable and K is an unknown variable.
- Multiple index variables in the same subscript location.
For example, A[i + 2 × j], where i and j are loop index variables for a double nested loop.
The array index A[i][j] can be analyzed effectively because the index variables are in different subscripts.
- Nonlinear indexing.
For example, A[i & C], where i is a loop index variable and C is a nonconstant variable.
Use Loops with Constant Bounds Whenever Possible
The compiler can perform range analysis effectively when loops have constant bounds.
Ignore Loop-Carried Dependencies
If there are no implicit memory dependencies across loop iterations, you can use the ivdep pragma to tell the Intel® HLS Compiler Standard Edition to ignore the memory dependency.
For details about how to use the ivdep pragma, see Loop-Carried Dependencies (ivdep Pragma) in the Intel® High Level Synthesis Compiler Standard Edition Reference Manual.
4.5. Avoid Complex Loop-Exit Conditions
If a loop in your component has complex exit conditions, memory accesses or complex operations might be required to evaluate the condition. Subsequent iterations of the loop cannot launch in the loop pipeline until the evaluation completes, which can decrease the overall performance of the loop.
4.6. Convert Nested Loops into a Single Loop
To maximize performance, combine nested loops into a single loop whenever possible. The control flow for a loop adds overhead both in logic required and FPGA hardware footprint. Combining nested loops into a single loop reduces these aspects, improving the performance of your component.
The following code examples illustrate the conversion of a nested loop into a single loop:
Nested Loop | Converted Single Loop |
---|---|
for (i = 0; i < N; i++) { //statements for (j = 0; j < M; j++) { //statements } //statements } |
for (i = 0; i < N*M; i++) { //statements } |
4.7. Declare Variables in the Deepest Scope Possible
To reduce the FPGA hardware resources necessary for implementing a variable, declare the variable just before you use it in a loop. Declaring variables in the deepest scope possible minimizes data dependencies and FPGA hardware usage because the Intel® HLS Compiler Standard Edition does not need to preserve the variable data across loops that do not use the variables.
Consider the following example:
int a[N]; for (int i = 0; i < m; ++i) { int b[N]; for (int j = 0; j < n; ++j) { // statements } }
The array a requires more resources to implement than the array b. To reduce hardware usage, declare array a outside the inner loop unless it is necessary to maintain the data through iterations of the outer loop.
5. Memory Architecture Best Practices
In most cases, you can optimize the memory architecture by modifying the access pattern. However, the Intel® HLS Compiler Standard Edition gives you some control over the memory architecture.
Tutorials Demonstrating Memory Architecture Best Practices
The Intel® HLS Compiler comes with a number of tutorials that give you working examples to review and run so that you can see good coding practices as well as demonstrating important concepts.
Tutorial | Description |
---|---|
You can find these tutorials in the
following location on your
Intel®
Quartus® Prime
system:<quartus_installdir>/hls/examples/tutorials/ |
|
component_memories/bank_bits | Demonstrates how to control component internal memory architecture for parallel memory access by enforcing which address bits are used for banking. |
component_memories/depth_wise_merge | Demonstrates how to improve resource utilization by implementing two logical memories as a single physical memory with a depth equal to the sum of the depths of the two original memories. |
component_memories/width_wise_merge | Demonstrates how to improve resource utilization by implementing two logical memories as a single physical memory with a width equal to the sum of the widths of the two original memories. |
5.1. Example: Overriding a Coalesced Memory Architecture
Using memory attributes in various combinations in your code allows you to override the memory architecture that the Intel® HLS Compiler Standard Edition infers for your component.
The following code examples demonstrate how you can use the following memory attributes to override coalesced memory to conserve memory blocks on your FPGA:
- hls_bankwidth(n)
- hls_numbanks(n)
- hls_singlepump
- hls_numports_readonly_writeonly(m,n)
The original code coalesces two memory accesses, resulting in a memory memory system that is to 256 bits wide by 64 bits wide (two on-chip memory blocks):
component unsigned int mem_coalesce_default(unsigned int raddr, unsigned int waddr, unsigned int wdata){ unsigned int data[512]; data[2*waddr] = wdata; data[2*waddr + 1] = wdata + 1; unsigned int rdata = data[2*raddr] + data[2*raddr + 1]; return rdata; }
The following images show how the 256x64 bit memory for this code sample is structured, as well how the component memory structure is shown in the high-level design report (report.html)
![]() |
The modified code implements a simple dual-port on-chip memory block that is 512 words deep by 32 bits wide with stallable arbitration:
component unsigned int mem_coalesce_override(unsigned int raddr, unsigned int waddr, unsigned int wdata){ //Attributes that stop memory coalescing hls_bankwidth(4) hls_numbanks(1) //Attributes that specify a simple dual port hls_singlepump hls_numports_readonly_writeonly(1,1) unsigned int data[512]; data[2*waddr] = wdata; data[2*waddr + 1] = wdata + 1; unsigned int rdata = data[2*raddr] + data[2*raddr + 1]; return rdata; }
The following images show how the 512x32 bit memory with stallable arbitration for this code sample is structured, as well how the component memory structure is shown in the high-level design report (report.html).
![]() |
While it might appear that you save hardware area by reducing the number of RAM blocks needed for the component, the introduction of stallable arbitration increases the amount of hardware needed to implement the component. In the following table, you can compare the number ALMs and FFs required by the components.

5.2. Example: Overriding a Banked Memory Architecture
Using memory attributes in various combinations in your code allows you to override the memory architecture that the Intel® HLS Compiler Standard Edition infers for your component.
The following code examples demonstrate how you can use the following memory attributes to override banked memory to conserve memory blocks on your FPGA:
- hls_bankwidth(N)
- hls_numbanks(N)
- hls_singlepump
- hls_doublepump
The original code creates two banks of single-pumped on-chip memory blocks that are 16 bits wide:
component unsigned short mem_banked(unsigned short raddr, unsigned short waddr, unsigned short wdata){ unsigned short data[1024]; data[2*waddr] = wdata; data[2*waddr + 9] = wdata +1; unsigned short rdata = data[2*raddr] + data[2*raddr + 9]; return rdata; }
To save banked memory, you can implement one bank of double-pumped 32-bit wide on-chip memory block by adding the following attributes before the declaration of data[1024]. These attributes fold the two half-used memory banks into one fully-used memory bank that is double-pumped, so that it can be accessed as quickly as the two half-used memory banks.
hls_bankwidth(2) hls_numbanks(1) hls_doublepump unsigned short data[1024];
Alternatively, you can avoid the double-clock requirement of the double-pumped memory by implementing one bank of single-pumped on-chip memory block by adding the following attributes before the declaration of data[1024]. However, in this example, these attributes add stallable arbitration to your component memories, which hurts your component performance.
hls_bankwidth(2) hls_numbanks(1) hls_singlepump unsigned short data[1024];
5.3. Merge Memories to Reduce Area
In some cases, you can save FPGA memory blocks by merging your component memories so that they consume fewer memory blocks, reducing the FPGA area your component uses. Use the hls_merge attribute to force the Intel® HLS Compiler Standard Edition to implement different variables in the same memory system.
When you merge memories, multiple component variables share the same memory block. You can merge memories by width (width-wise merge) or depth (depth-wise merge). You can merge memories where the data in the memories have different datatypes.
The following diagram shows how four memories can be merged width-wise and depth-wise.
5.3.1. Example: Merging Memories Depth-Wise
Use the hls_merge("<mem_name>","depth") attribute to force the Intel® HLS Compiler Standard Edition to implement variables in the same memory system, merging their memories by depth.
All variables with the same <mem_name> label set in their hls_merge attributes are merged.
Consider the following component code:
component int depth_manual(bool use_a, int raddr, int waddr, int wdata) { int a[128]; int b[128]; int rdata; // mutually exclusive write if (use_a) { a[waddr] = wdata; } else { b[waddr] = wdata; } // mutually exclusive read if (use_a) { rdata = a[raddr]; } else { rdata = b[raddr]; } return rdata; }
The code instructs the Intel® HLS Compiler to implement local memories a and b as two on-chip memory blocks, each with its own load and store instructions.
Because the load and store instructions for local memories a and b are mutually exclusive, you can merge the accesses, as shown in the example code below. Merging the memory accesses reduces the number of load and store instructions, and the number of on-chip memory blocks, by half.
component int depth_manual(bool use_a, int raddr, int waddr, int wdata) { int a[128] hls_merge("mem","depth"); int b[128] hls_merge("mem","depth"); int rdata; // mutually exclusive write if (use_a) { a[waddr] = wdata; } else { b[waddr] = wdata; } // mutually exclusive read if (use_a) { rdata = a[raddr]; } else { rdata = b[raddr]; } return rdata; }
There are cases where merging local memories with respect to depth might degrade memory access efficiency. Before you decide whether to merge the local memories with respect to depth, refer to the HLD report ( <result>.prj/reports/report.html) to ensure that they have produced the expected memory configuration with the expected number of loads and stores instructions. In the example below, the Intel® HLS Compiler should not merge the accesses to local memories a and b because the load and store instructions to each memory are not mutually exclusive.
component int depth_manual(bool use_a, int raddr, int waddr, int wdata) { int a[128] hls_merge("mem","depth"); int b[128] hls_merge("mem","depth"); int rdata; // NOT mutually exclusive write a[waddr] = wdata; b[waddr] = wdata; // NOT mutually exclusive read rdata = a[raddr]; rdata += b[raddr]; return rdata; }
In this case, the Intel® HLS Compiler might double pump the memory system to provide enough ports for all the accesses. Otherwise, the accesses must share ports, which prevent stall-free accesses.
5.3.2. Example: Merging Memories Width-Wise
Use the hls_merge("<mem_name>","width") attribute to force the Intel® HLS Compiler Standard Edition to implement variables in the same memory system, merging their memories by width.
All variables with the same <mem_name> label set in their hls_merge attributes are merged.
Consider the following component code:
component short width_manual (int raddr, int waddr, short wdata) { short a[256]; short b[256]; short rdata = 0; // Lock step write a[waddr] = wdata; b[waddr] = wdata; // Lock step read rdata += a[raddr]; rdata += b[raddr]; return rdata; }
In this case, the Intel® HLS Compiler can coalesce the load and store instructions to local memories a and b because their accesses are to the same address, as shown below.
component short width_manual (int raddr, int waddr, short wdata) { short a[256] hls_merge("mem","width"); short b[256] hls_merge("mem","width"); short rdata = 0; // Lock step write a[waddr] = wdata; b[waddr] = wdata; // Lock step read rdata += a[raddr]; rdata += b[raddr]; return rdata; }
5.4. Example: Specifying Bank-Selection Bits for Local Memory Addresses
The (b 0 , b 1 , ... ,b n ) arguments refer to the local memory address bit positions that the Intel® HLS Compiler should use for the bank-selection bits. Specifying the hls_bankbits(b 0, b 1, ..., b n) attribute implies that the number of banks equals 2 number of bank bits .
Bank 0 | Bank 1 | Bank 2 | Bank 3 | |
Word 0 | 00 000 | 01 000 | 10 000 | 11 000 |
Word 1 | 00 001 | 01 001 | 10 001 | 11 001 |
Word 2 | 00 010 | 01 010 | 10 010 | 11 010 |
Word 3 | 00 011 | 01 011 | 10 011 | 11 011 |
Word 4 | 00 100 | 01 100 | 10 100 | 11 100 |
Word 5 | 00 101 | 01 101 | 10 101 | 11 101 |
Word 6 | 00 110 | 01 110 | 10 110 | 11 110 |
Word 7 | 00 111 | 01 111 | 10 111 | 11 111 |
Example of Implementing the hls_bankbits Attribute
Consider the following example component code:
component int bank_arb_consecutive_multidim (int raddr, int waddr, int wdata, int upperdim) { int a[2][4][128] hls_numbanks(1); #pragma unroll for (int i = 0; i < 4; i++) { a[upperdim][i][(waddr & 0x7f)] = wdata + i; } int rdata = 0; #pragma unroll for (int i = 0; i < 4; i++) { rdata += a[upperdim][i][(raddr & 0x7f)]; } return rdata; }
As illustrated in the following figure, this code example generates multiple load and store instructions, and therefore multiple load/store units (LSUs) in the hardware. If the memory system is not split into multiple banks, there are fewer ports than memory access instructions, leading to arbitrated accesses. This arbitration results in a high loop initiation interval (II) value. Avoid arbitration blocks whenever possible because they consume a lot of FPGA area and can hurt the performance of your component.

By default, the Intel® HLS Compiler splits the memory into banks if it determines that the split is beneficial to the performance of your component. When the compiler generates a memory system, it uses the lower-order memory address bits to access the different memory banks. This behavior means that if you define your component memory structure so that the lowest order addresses are accessed in parallel, the compiler automatically infers the bank-selection bits for you.
This access pattern prevents stallable arbitration on the memory. In this case, preventing stallable arbitration reduced the II value to 1. In practice, this might mean that you store a matrix in column-major format instead of row-major format, if you intend to access multiple matrix rows concurrently.
Swapping the 128-element and 4-element dimension in the code example that follows results in no stallable memory arbitration.
component int bank_arb_consecutive_multidim (int raddr, int waddr, int wdata, int upperdim) { int a[2][128][4]; #pragma unroll for (int i = 0; i < 4; i++) { a[upperdim][(waddr & 0x7f)][i] = wdata + i; } int rdata = 0; #pragma unroll for (int i = 0; i < 4; i++) { rdata += a[upperdim][(raddr & 0x7f)][i]; } return rdata; } |
![]() |
The dimension that is accessed in parallel is moved to be the lowest-order dimension in the memory array. The load has a width of 128 bits, which is the same as four 32-bit loads.
If you cannot change your memory structure, you can use the hls_bankbits attribute to explicitly control how load and store instructions access local memory. As shown in the following modified code example and figure, when you choose constant bank-select bits for each access to the local memory a, each pair of load and store instructions needs to connect to only one memory bank. In this example, there are four 32-bit loads, which results in a memory system similar to the earlier example.
component int bank_arb_consecutive_multidim (int raddr, int waddr, int wdata, int upperdim) { int a[2][4][128] hls_bankbits(8,7); #pragma unroll for (int i = 0; i < 4; i++) { a[upperdim][i][(waddr & 0x7f)] = wdata + i; } int rdata = 0; #pragma unroll for (int i = 0; i < 4; i++) { rdata += a[upperdim][i][(raddr & 0x7f)]; } return rdata; } |
![]() |
When specifying the word-address bits for the hls_bankbits attribute, ensure that the resulting bank-select bits are constant for each access to local memory. As shown in the following example, the local memory access pattern does not guarantee that the chosen bank-select bits are constant for each access. As a result, each pair of load and store instructions must connect to all the local memory banks, leading to stallable accesses.
component int bank_arb_consecutive_multidim (int raddr, int waddr, int wdata, int upperdim){ int a[2][4][128] hls_bankbits(5,4); #pragma unroll for (int i = 0; i < 4; i++) { a[upperdim][i][(waddr & 0x7f)] = wdata + i; } int rdata = 0; #pragma unroll for (int i = 0; i < 4; i++) { rdata += a[upperdim][i][(raddr & 0x7f)]; } return rdata; } |
![]() |
In this case, the II is estimated to be approximately 64.
6. Datatype Best Practices
After you optimize the algorithm bottlenecks of your design, you can fine-tune some datatypes in your component by using arbitrary precision datatypes to shrink data widths, which reduces FPGA area utilization. The Intel® HLS Compiler Standard Edition provides debug functionality so that you can easily detect overflows in arbitrary precision datatypes.
Tutorials Demonstrating Datatype Best Practices
The Intel® HLS Compiler Standard Edition comes with a number of tutorials that give you working examples to review and run so that you can see good coding practices as well as demonstrating important concepts.
Tutorial | Description |
---|---|
You can find these tutorials in the
following location on your
Intel®
Quartus® Prime
system:<quartus_installdir>/hls/examples/tutorials |
|
best_practices/ac_datatypes | Demonstrates the effect of using ac_int datatype instead of int datatype. |
ac_datatypes/ac_fixed_constructor | Demonstrates the use of the ac_fixed constructor where you can get a better QoR by using minor variations in coding style. |
ac_datatypes/ac_int_basic_ops | Demonstrates the operators available for the ac_int class. |
ac_datatypes/ac_int_overflow | Demonstrates the usage of the DEBUG_AC_INT_WARNING and DEBUG_AC_INT_ERROR keywords to help detect overflow during emulation runtime. |
best_practices/single_vs_double_precision_math | Demonstrates the effect of using single precision literals and functions instead of double precision literals and functions. |
best_practices/integer_promotion | Demonstrates how integer promotion rules can influence the behavior of a C or C++ program. |
6.1. Avoid Implicit Data Type Conversions
Using this option helps you avoid inadvertently having conversions between double-precision and single-precision values when double-precisions variables are not needed. In FPGAs, using double-precision variables can negatively affect the data transfer rate, the latency, and resource utilization of your component.
If you use the Algorithmic C (AC) arbitrary precision datatypes, pay attention to the type propagation rules.
6.2. Avoid Negative Bit Shifts When Using the ac_int Datatype
The ac_int datatype differs from other languages, including C and Verilog, in bit shifting. By default, if the shift amount is of a signed datatype ac_int allows negative shifts.
In hardware, this negative shift results in the implementation of both a left shifter and a right shifter. The following code example shows a shift amount that is a signed datatype.
int14 shift_left(int14 a, int14 b) { return (a << b); }
If you know that the shift is always in one direction, to implement an efficient shift operator, declare the shift amount as an unsigned datatype as follows:
int14 efficient_left_only_shift(int14 a, uint14 b) { return (a << b); }
7. Advanced Troubleshooting
- Your component behaves differently in cosimulation and emulation.
- Your component has unexpectedly poor performance, resource utilization, or both.
7.1. Component Fails Only In Cosimulation
Comparing Floating Point Results
Use an epsilon when comparing floating point value results in the testbench. Floating points results from the RTL hardware are different from the x86 emulation flow.
Using #pragma ivdep to Ignore Memory Dependencies
The #pragma ivdep compiler pragma can cause functional incorrectness in your component if your component has a memory dependency that you attempted to ignore with the pragma. You can try to use the safelen modifier to control how many memory accesses that you can permit before a memory dependency occurs.
See Loop-Carried Dependencies (ivdep Pragma) in Intel® High Level Synthesis Compiler Standard Edition Reference Manual for a description of this pragma.
To see an example of using the ivdep pragma, review the tutorial in <quartus_installdir>/hls/examples/tutorials/best_practices/loop_memory_dependency.
Unintentional Integer Promotion
The Intel® HLS Compiler Standard Edition does not automatically promote small data types (such as unsigned char or short) to 32-bit widths during cosimulation. Other compilers (like g++) might promote integers when compiling your component for emulation. This difference in integer promotion behavior means that a value in your component has preserved overflow bits during emulation but is truncated during cosimulation.
component int add_width(unsigned char a, unsigned char b) { int sum = a + b; return sum; }
This code example generates different overflow behavior in emulation and cosimulation. The Intel® HLS Compiler truncates the integers at 8 bit, while other C++ compilers preserve the overflow bits.
You can mimic the integer promotion behavior by using the --promote-integers compiler option. See Compiler Options in Intel® High Level Synthesis Compiler Standard Edition Reference Manual for a description of this compiler option.
To see an example of using the --promote-integers compiler option, review the tutorial in <quartus_installdir>/hls/examples/tutorials/best_practices/integer_promotion.
Check for Uninitialized Variables
Many coding practices can result in behavior that is undefined by the C++ specification. Sometimes this undefined behavior works as expected in emulation, but not in cosimulation.
A common example of this situation occurs when your design reads from uninitialized variables, especially uninitialized struct variables.
Check your code for uninitialized values with the -Wuninitialized compiler flag, or debug your emulation testbench with the valgrind debugging tool. The -Wuninitialized compiler flag does not show uninitialized struct variables.
You can also check for misbehaving variables by using one or more stream interfaces as debug streams. You can add one or more ihc::stream_out interfaces to your component to have the component write out its internal state variables as it executes. By comparing the output of the emulation flow and the cosimulation flow, you can see where the RTL behavior diverges from the emulator behavior.
Non-blocking Stream Accesses
The emulation model of tryRead() is not cycle-accurate, so the behavior of tryRead() might differ between emulation and co-simulation.
If you have a non-blocking stream access (for example, tryRead()) from a stream with a FIFO (that is, the ihc::depth<> template parameter), then the first few iterations of tryRead() might return false in co-simulation, but return true in emulation.
In this case, invoke your component a few extra times from the testbench to guarantee that it consumes all data in the stream. These extra invocations should not cause functional problems because tryRead() returns false.
7.2. Component Gets Bad Quality of Results
The information in this section describes some common sources of stallable arbitration nodes or excess RAM utilization.
Component Uses More FPGA Resource Than Expected
By default, the Intel® HLS Compiler Standard Edition tries to optimize your component for the best throughput by trying to maximize the maximum operating frequency (fMAX).
A way to reduce area consumption is to relax the fMAX requirements by setting a target fMAX value with the --clock i++ command option. The HLS compiler can often achieve a higher fMAX than you specify, so when you set a target fMAXr to a lower value than you need, your design might still achieve an acceptable fMAX value, and a design that consumes less area.
Incorrect Bank Bits
If you access parts of an array in parallel (either a single- or multidimensional array), you might need to configure the memory bank selection bits.
See Memory Architecture Best Practices for details about how to configure efficient memory systems.
Conditional Operator Accessing Two Different Arrays of struct Variables
In some cases, if you try to access different arrays of struct variables with a conditional operator, the Intel® HLS Compiler merges the arrays into the same RAM block. You might see stallable arbitration in the Component Memory Viewer because there are not enough Load/Store site on the memory system.
struct MyStruct { float a; float b; } MyStruct array1[64]; MyStruct array2[64];
MyStruct value = (shouldChooseArray1) ? array1[idx] : array2[idx];
MyStruct value; if (shouldChooseArray1) { value = array1[idx]; } else { value = array2[idx]; }
File-Scoped Static Variables
The Intel® HLS Compiler Standard Edition supports file-scoped static variables, but any memory attributes that you apply to static arrays work only if the static array is declared within the component function. Memory attributes applied to file-scope static variables are ignored. Memory attributes applied to a variable are also ignored if you attempt to apply attributes to a array members in a struct or class definition.
If you want to override the default memory settings for an array variable, ensure that the array variable is declared in the scope of the component function where the array variable is used. You can pass pointers to the static array to any subroutines that might access the static array.
This code change is shown in the following example. The code samples and high-level design report views that follow compare two implementations of a component that reads data from a stream into a local memory, then processes the data that is in that local memory.
In the first code example, the local memory is a file-scoped static variable. In the second code example, the local memory is a function-scoped static variable.
The second code example gets better QoR because you can apply memory optimization attributes to the static variable declaration. In this second example, the hls_memory and hls_numbanks(1) attributes force the static array into a single bank of on-chip RAM blocks.
hls_memory hls_numbbanks(1) static int myStaticArray[64]; void loadData(ihc::stream_in<int> &intStreamIn) { for(int idx = 0; idx < 64; idx++) { myStaticArray[idx] = intStreamIn.read(); } } int findMax() { int maxVal = 0; for(int idx = 0; idx < 64; idx++) { int val = myStaticArray[idx]; if (val > maxVal) { maxVal = val; } } return maxVal; } component int dut(ihc::stream_in<int> &intStreamIn) { loadData(intStreamIn); return findMax(); }

void loadData(ihc::stream_in<int> &intStreamIn, int myStaticArray[64]) { for(int idx = 0; idx < 64; idx++) { myStaticArray[idx] = intStreamIn.read(); } } int findMax(int myStaticArray[64]) { int maxVal = 0; for(int idx = 0; idx < 64; idx++) { int val = myStaticArray[idx]; if (val > maxVal) { maxVal = val; } } return maxVal; } component int dut(ihc::stream_in<int> &intStreamIn) { hls_memory hls_numbbanks(1) static int myStaticArray[64]; loadData(intStreamIn, myStaticArray); return findMax(myStaticArray); }

Cluster Logic
Your design might consume more RAM blocks than you expect, especially if you store many array variables in large registers. The Area Analysis of System report in the high-level design report (report.html) can help find this issue.

The three matrices are stored intentionally in RAM blocks, but the RAM blocks for the matrices account for less than half of the RAM blocks consumed by the component.
If you look further down the report, you might see that many RAM blocks are consumed by Cluster logic or State variable. You might also see that some of your array values that you intended to be stored in registers were instead stored in large numbers of RAM blocks.

Notice the number of RAM blocks that are consumed by Cluster Logic and State.
- Pipeline loops instead of unrolling them.
- Storing local variables in local RAM blocks (hls_memory memory attribute) instead of large registers (hls_register memory attribute).
A. Intel HLS Compiler Standard Edition Best Practices Guide Archives
Intel® HLS Compiler Version | Title |
---|---|
19.1 | Intel® HLS Compiler Standard Edition Best Practices Guide |
18.1.1 | Intel® HLS Compiler Best Practices Guide |
18.1 | Intel® HLS Compiler Best Practices Guide |
18.0 | Intel® HLS Compiler Best Practices Guide |
17.1.1 | Intel® HLS Compiler Best Practices Guide |
17.1 | Intel® HLS Compiler Best Practices Guide |
B. Document Revision History for Intel HLS Compiler Standard Edition Best Practices Guide
Document Version | Intel® HLS Compiler Standard Edition Version | Changes |
---|---|---|
2019.12.18 | 19.1 |
|
Document Revision History for Intel® HLS Compiler Best Practices Guide
Previous versions of the Intel® HLS Compiler Best Practices Guide contained information for both Intel® HLS Compiler Standard Edition and Intel® HLS Compiler Pro Edition.
Document Version | Intel® Quartus® Prime Version | Changes |
---|---|---|
2019.09.30 | 19.3 |
|
2019.07.01 | 19.2 |
|
2019.04.01 | 19.1 |
|
2018.12.24 | 18.1 |
|
2018.09.24 | 18.1 |
|
2018.07.02 | 18.0 |
|
2018.05.07 | 18.0 |
|
2017.12.22 | 17.1.1 |
|
2017.11.06 | 17.1 | Initial release. Parts of this book consist of content previously found in the Intel® High Level Synthesis Compiler User Guide and the Intel® High Level Synthesis Compiler Reference Manual. |