Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide
Version Information
Updated for: |
---|
Intel® Quartus® Prime Design Suite 20.4 |
1. Intel HLS Compiler Pro Edition Best Practices Guide
In this publication, <quartus_installdir> refers to the location where you installed Intel® Quartus® Prime Design Suite.
- Windows
- C:\intelFPGA_pro\20.4
- Linux
- /home/<username>/intelFPGA_pro/20.4
About the Intel® HLS Compiler Pro Edition Documentation Library
Title and Description | |
---|---|
Release
Notes
Provide late-breaking information about the Intel® HLS Compiler. |
Link |
Getting Started
Guide
Get up and running with the Intel® HLS Compiler by learning how to initialize your compiler environment and reviewing the various design examples and tutorials provided with the Intel® HLS Compiler. |
Link |
User Guide
Provides instructions on synthesizing, verifying, and simulating intellectual property (IP) that you design for Intel FPGA products. Go through the entire development flow of your component from creating your component and testbench up to integrating your component IP into a larger system with the Intel Quartus Prime software. |
Link |
Reference
Manual
Provides reference information about the features supported by the Intel HLS Compiler. Find details on Intel® HLS Compiler command options, header files, pragmas, attributes, macros, declarations, arguments, and template libraries. |
Link |
Best Practices
Guide
Provides techniques and practices that you can apply to improve the FPGA area utilization and performance of your HLS component. Typically, you apply these best practices after you verify the functional correctness of your component. |
Link |
Quick
Reference
Provides a brief summary of Intel HLS Compiler declarations and attributes on a single two-sided page. |
Link |
2. Best Practices for Coding and Compiling Your Component
- Understand FPGA Concepts
A key best practice to help you get the most out of the Intel® HLS Compiler is to understanding important concepts about FPGAs. With an understanding of FPGA architecture, and some FPGA hardware design concepts and methods, you can create better designs that take advantage your target FPGA devices.
-
Interface Best Practices
With the Intel® High Level Synthesis Compiler, your component can have a variety of interfaces: from basic wires to the Avalon Streaming and Avalon Memory-Mapped Master interfaces. Review the interface best practices to help you choose and configure the right interface for your component.
-
Loop Best Practices
The Intel® High Level Synthesis Compiler pipelines your loops to enhance throughput. Review these loop best practices to learn techniques to optimize your loops to boost the performance of your component.
-
Memory Architecture Best Practices
The Intel® High Level Synthesis Compiler infers efficient memory architectures (like memory width, number of banks and ports) in a component by adapting the architecture to the memory access patterns of your component. Review the memory architecture best practices to learn how you can get the best memory architecture for your component from the compiler.
-
System of Tasks Best Practices
Using a system of HLS tasks in your component enables a variety of design structures that you can implement including executing multiple loops in parallel and sharing an expensive compute block.
-
Datatype Best Practices
The datatypes in your component and possible conversions or casting that they might undergo can significantly affect the performance and FPGA area usage of your component. Review the datatype best practices for tips and guidance how best to control datatype sizes and conversions in your component.
- Alternative Algorithms
The Intel® High Level Synthesis Compiler lets you compile a component quickly to get initial insights into the performance and area utilization of your component. Take advantage of this speed to try larger algorithm changes to see how those changes affect your component performance.
3. FPGA Concepts
A key best practice to help you get the most out of the Intel® HLS Compiler is to understanding important concepts about FPGAs. With an understanding of FPGA architecture, and some FPGA hardware design concepts and methods, you can create better designs that take advantage your target FPGA devices.
3.1. FPGA Architecture Overview
A field-programmable gate array (FPGA) is a reconfigurable semiconductor integrated circuit (IC).
FPGAs occupy a unique computational niche relative to other compute devices, such as central and graphics processing units (CPUs and GPUs), and custom accelerators, such as application-specific integrated circuits (ASICs). CPUs and GPUs have a fixed hardware structure to which a program maps, while ASICs and FPGAs can build custom hardware to implement a program.
While a custom ASIC generally outperforms an FPGA on a specific task, ASICs take significant time and money to develop. FPGAs are a cheaper off-the-shelf alternative that you can reprogram for each new application.
An FPGA is made up of a grid of configurable logic, known as adaptive logic modules (ALMs), and specialized blocks, such as digital signal processing (DSP) blocks and random-access memory (RAM) blocks. These programmable blocks are combined using configurable routing interconnects to implement complete digital circuits.
The total number of ALMs, DSP blocks, and RAM blocks used by a design is often referred to as the FPGA area or area that the design uses.
The following image illustrates a high-level architectural view of an FPGA:
3.1.1. Adaptive Logic Module (ALM)
The basic building block in an FPGA is an adaptive logic module (ALM).
A simplified ALM consists of a lookup table (LUT) and an output register from which the compiler can build any arbitrary Boolean logic circuit.
The following figure illustrates a simplified ALM:
3.1.1.1. Lookup Table (LUT)
A lookup table (LUT) that implements an arbitrary Boolean function of N inputs is often referred to as an N-LUT.
3.1.1.2. Register
A register is the most basic storage element in an FPGA. It has an input (in), an output (out), and a clock signal (clk). It is synchronous, that is, it synchronizes output changes to a clock. In an ALM, a register may store the output of the LUT.
The following figure illustrates a register:
The following figure illustrates the waveform of register signals:
The input data propagates to the output on every clock cycle. The output remains unchanged between clock cycles.
3.1.2. Digital Signal Processing (DSP) Block
A digital signal processing (DSP) block implements specific support for common fixed-point and floating-point arithmetic, which reduces the need to build equivalent logic from general-purpose ALMs.
The following figure illustrates a simplified three-input DSP block consisting of a multiplier (×) and an adder (+):
The following figure illustrates a simplified DSP block:
3.1.3. Random-Access Memory (RAM) Blocks
A random-access memory (RAM) block implements memory by using a high density of memory cells.
For more information, refer to Memory Types.
3.2. Concepts of FPGA Hardware Design
3.2.1. Maximum Frequency (fMAX)
The maximum clock frequency at which a digital circuit can operate is called its f MAX . The fMAX is the maximum rate at which the outputs of registers are updated.
The physical propagation delay of the signal across Boolean logic between two consecutive register stages limits the clock speed. This propagation delay is a function of the complexity of the combinational logic in the path.
The path with the most combinational logic elements (and the highest delay) limits the speed of the entire circuit. This speed limiting path is often referred to as the critical path.
The fMAX is calculated as the inverse of the critical path delay. You may want to have high fMAX since it results in high performance in the absence of other bottlenecks.
3.2.2. Latency
Latency is the measure of how long it takes to complete one or more operations in a digital circuit. You can measure latency at different granularities. For example, you can measure the latency of a single operation or the latency of the entire circuit.
You can measure latency in time (for example, microseconds) or in clock cycles. Typically, clock cycles are the preferred way to express latency because measuring latency in clock cycles disconnects latency from your circuit clock frequency. By expressing latency independent of circuit clock frequency, it is easier to discern the true impact of circuit changes to the performance of the circuit.
You may want to have low latency, but lowering latency might result in decreased fMAX.
For more information and an example, refer to Pipelining.
3.2.3. Pipelining
Pipelining is a design technique used in synchronous digital circuits to increase fMAX. Pipelining involves adding registers to the critical path, which decreases the amount of logic between each register. Less logic takes less time to execute, which enables an increase in fMAX.
The critical path in a circuit is the path between any two consecutive registers with the highest latency. That is, the path between two consecutive registers where the operations take the longest to complete.
Pipelining is especially useful when processing a stream of data. A pipelined circuit can have different stages of the pipeline operating on different input stream data in the same clock cycle, which leads to better data processing throughput.
Example
Consider a simple circuit with operations A and B on the critical path. If operation A takes 5 ns to complete and operation B takes 15 ns to complete, then the time delay on the critical path is 20 ns. This results in an fMAX of 50 MHz (1/max_delay).
If a pipeline register is added between A and B, the critical path changes. The delay on the critical path is now 15ns. Pipelining this block results in an fMAX of 66.67 MHz, and the maximum delay between two consecutive registers is 15 ns.
While pipelining generally results in a higher fMAX, it increases latency. In the previous example, the latency of the block containing A and B increases from two to three clock cycles after pipelining.
3.2.4. Throughput
Throughput of a digital circuit is the rate at which data is processed.
In the absence of other bottlenecks, higher fMAX results in higher throughput (for example, samples/second).
Throughput is a good measure of the performance of a circuit, and throughput and performance are often used interchangeably when discussing a circuit.
3.2.5. Datapath
A datapath is a chain of registers and Boolean logic in a digital circuit that performs computations.
For example, the datapath in Pipelining consists of all of the elements shown, from the input register to the last output register.
In contrast, memory blocks are outside the datapath and reads and writes to memory are also considered to be outside of the datapath.
3.2.6. Control Path
While the datapath is the path on which computations occur, the control path is the path of signals that control the datapath circuitry.
- Handshaking flow control
Handshaking ensures that one part of your design is ready and able to accept data from another part of your design.
- Loop control
Loop controls control the flow of data through the hardware generated for loops in your code, including any loop carried dependencies.
- Branch control
Branch controls implement conditional statements in your code. Branch control can include parallelizing parts of conditional statements to improve performance.
The control path also consumes FPGA area, and the compiler uses techniques like clustering the datapath to help reduce the control path and save area. To learn about clustering, refer to Clustering the Datapath.
3.2.7. Occupancy
The occupancy of a datapath at a point in time refers to the proportion of the datapath that contains valid data.
The occupancy of a circuit over the execution of a program is the average occupancy over time from the moment the program starts to run until it has completed.
Unoccupied portions of the datapath are often referred to as bubbles. Bubbles are analogous to a "no operation" (no-op) instructions for a CPU that have no effect on the final output.
Decreasing bubbles increases occupancy. In the absence of other bottlenecks, maximizing occupancy of the datapath results in higher throughput.
3.3. Methods of Hardware Design
Traditionally, you program an FPGA using a hardware description language (HDL) such as Verilog or VHDL. However, a recent trend is to use higher-level languages.
Higher levels of abstraction can reduce the design time and increase the portability of your design.
The sections that follow discuss how Intel® HLS Compiler maps high-level languages to a hardware datapath.
3.3.1. How Source Code Becomes a Custom Hardware Datapath
Based on your source code and the following principles, the Intel® HLS Compiler builds a custom hardware datapath in the FPGA logic:
- The datapath is functionally equivalent to the C++ program described by the source. Once the datapath is constructed, the compiler orchestrates work items and loop iterations such that the hardware is effectively occupied.
- The compiler builds the custom hardware datapath while minimizing area of the FPGA resources (like ALMs and DSPs) used by the design.
3.3.1.1. Mapping Source Code Instructions to Hardware
For fixed architectures, such as CPUs and GPUs, a compiler compiles code into a set of instructions that run on functional units that have a fixed functionality. For these fixed architectures to be useful in a broad range of applications, some of their available functional units are not useful to every program. Unused functional units mean that your program does not fully occupy the fixed architecture hardware.
FPGAs are not subject to these restrictions of fixed functional units. On an FPGA, you can synthesize a specialized hardware datapath that can be fully occupied for an arbitrary set of instructions, which means you can be more efficient with the silicon area of your chip.
By implementing your algorithm in hardware, you can fill your chip with custom hardware that is always (or almost always) working on your problem instead of having idle functional units.
The Intel® HLS Compiler maps statements from the source code to individual specialized hardware operations, as shown in the example in the following image:
In general, each instruction maps to its own unique instance of a hardware operation. However, a single statement can map to more than one hardware operation, or multiple statements can combine into a single hardware operation when the compiler finds that it can generate hardware that is more efficient.
The latency of hardware operations is dependent on the complexity of the operation and the target fMAX.
The compiler takes these hardware operations and connects them into a graph based on their dependencies. When operations are independent, the compiler automatically infers parallelism by executing those operations simultaneously in time.
The following figure shows a dependency graph created for the hardware datapath. The dependency graph shows how the instruction is mapped to hardware operations and how the hardware operations are connected based on their dependencies. The loads in this example instruction are independent of each other and can therefore run simultaneously.
3.3.1.2. Mapping Arrays and Their Accesses to Hardware
Similar to the mapping of statements to specialized hardware operations, the compiler can map arrays (and structs) to hardware memories based on memory access patterns and variable sizes.
The datapath interacts with this memory through load/store units (LSUs), which are inferred from array accesses in the source code.
The following figure illustrates a simple example of mapping arrays and their accesses to hardware:
A RAM can have a limited number of read ports and write ports, but a datapath can have many LSUs. When the number of LSUs does not match the available number of read and write ports, the compiler uses techniques like replication, double pumping, sharing, and arbitration. For descriptions of these techniques, refer to Component Memory.
FPGAs provide specialized hardware block RAMs that you can configure and combine to match the size of your arrays. Customizing your memory configuration for your design can provide terabytes-per-second of on-chip memory bandwidth because each of these memories can interact with the datapath simultaneously.
Arrays might also be implemented in your component datapath. In this case, the array contents are stored as registers in the datapath when your algorithm is pipelined (as discussed in Pipelining). Storing array contents as registers in the datapath can improve performance in some cases, but it is a design decision whether to implement an array as registers or as memories.
When you access an array that is implemented as registers, LSUs are not used. The compiler might choose to use a select or a barrel shifter instead.
3.3.2. Scheduling
Scheduling refers to the process of determining the clock cycles at which each operation in the datapath executes.
Pipelining is the outcome of scheduling.
3.3.2.1. Dynamic Scheduling
The Intel® HLS Compiler generates pipelined datapaths that are dynamically scheduled.
A dynamically scheduled portion of the datapath does not pass data to its successor until its successor signals that it is ready to receive it.
This signaling is accomplished using handshaking control logic. For example, a variable latency load from memory may refuse to accept its predecessors' data until the load is complete.
Handshaking helps remove bubbles in the pipeline, which increases occupancy. For more information about bubbles, refer to Occupancy.
The following figure illustrates four regions of dynamically scheduled logic:
3.3.2.2. Clustering the Datapath
Dynamically scheduling all operations adds overhead in the form of additional FPGA area needed to implement the required handshaking control logic.
To reduce this overhead, the compiler groups fixed latency operations into clusters. A cluster of fixed latency operations, such as arithmetic operations, needs fewer handshaking interfaces, thereby reducing the area overhead.
If A, B, and C from Figure 4 do not contain variable latency operations, the compiler can cluster them together, as illustrated in Figure 5.
Clustering the logic reduces area by removing the need for signals to stall data flow in addition to other handshaking logic within the cluster.
Cluster Types
The Intel® HLS Compiler can create the following types of clusters:
-
Stall-Enable
Cluster (SEC): This cluster type passes the handshaking logic to every pipeline
stage in the cluster in parallel. If the cluster is stalled by logic from further down in
the datapath, all logic in the SEC stalls at the same time. Figure 6. Stall-Enable Cluster
-
Stall-Free
Cluster (SFC): This cluster type adds a first in, first out (FIFO) buffer to the end
of the cluster that can accommodate at least the entire latency of the pipeline in the
cluster. This FIFO is often called an exit FIFO because it is
attached to the exit of the cluster datapath.
Because of this FIFO, the pipeline stages in the cluster do not require any handshaking logic. The stages can run freely and drain into the capacity FIFO, even if the cluster is stalled from logic further down in the datapath.
Cluster Characteristics
The exit FIFO of the stall free cluster results in some tradeoffs:-
Area: Because an SEC does not use an exit FIFO,
it can save FPGA area compared to an SFC.
If you have a design with many small, low-latency clusters, you can save a substantial amount of area by asking the compiler to use SECs instead of SFCs with the hls_use_stall_enable_clusterscomponent attribute. For details, refer to hls_use_stall_enable_clusters Component Attribute in the Intel® HLS Compiler Reference Manual .
-
Latency: Logic that uses SFCs might have a larger
latency than logic that uses SECs because of the write-read latency of the exit FIFO.
If you use a zero-latency FIFO for the exit FIFO, you can mitigate the latency but fMAX or FPGA area use might be negatively impacted.
-
fMAX
: In an SFC, the
oStall signal has less fanout than in an SEC.
For a cluster with many pipeline stages, you can improve your design fMAX by using an SFC.
-
Handshaking: The exit FIFO in SFCs allow them to take advantage of hyper-optimized
handshaking between clusters. For more information, refer to Hyper Optimized Handshaking.
SECs do not support this capability.
-
Bubble Handling: SECs remove only leading
bubbles. A leading bubble is a bubble that arrives before the first piece of valid data
arrives in the cluster. SECs do not remove any arriving afterwards.
SFCs can use the exit FIFO to remove all bubbles from the pipeline if the SFC gets a downstream stall signal.
- Stall Behavior: When an SEC receives a downstream stall
3.3.2.3. Handshaking Between Clusters
By default, the handshaking protocol between clusters is a simple stall/valid protocol. Data from the upstream cluster is consumed when the stall signal is low and the valid signal is high.
Hyper-Optimized Handshaking
If the distance across the FPGA between these two clusters is large, handshaking may become the critical path that affects peak fMAX. in the design
To improve these cases, the Intel® HLS Compiler can add pipelining registers to the stall/valid protocol to ease the critical path and improve fMAX. This enhanced handshaking protocol is called hyper-optimized handshaking.
The following timing diagram illustrates an example of upstream cluster 1 and downstream cluster 2 with two pipelining registers inserted in-between:
3.3.3. Mapping Parallelism Models to FPGA Hardware
This section describes how to map parallelism models to FPGA hardware:
3.3.3.1. Data Parallelism
Traditional instruction-set-architecture-based (ISA-based) accelerators, such as GPUs, derive data parallelism from vectorized instructions and by executing the same operation on multiple processing units.
In comparison, FPGAs derive their performance by taking advantage of their spatial architecture. FPGA compilers do not require you to vectorize your code. The compiler vectorizes your code automatically whenever it can.
3.3.3.1.1. Executing Independent Operations Simultaneously
As described in Mapping Source Code Instructions to Hardware, the compiler can automatically identify independent operations and execute them simultaneously in hardware.
This simulataneous execution of independent operations combined with pipelining is how performance through data parallelism is achieved on an FPGA.
The following image illustrates an example of an adder and a multiplier, which are scheduled to execute simultaneously while operating on separate inputs:
This automatic vectorization is analogous to how a superscalar processor takes advantage of instruction-level parallelism, but this vectorization happens statically at compile time instead of dynamically, at runtime.
Because determining instruction-level parallelism occurs at compile time, there is no hardware or runtime cost of dependency checking for the generated hardware datapath. Additionally, the flexible logic and routing of an FPGA means that only the available resources (like ALMs and DSPs) of the FPGA restrict the number of independent operations that can occur simultaneously.
Unrolling Loops
You can unroll loops in the design by using loop attributes. Loop unrolling decreases the number of iterations executed at the expense of increasing hardware resource consumption corresponding to executing multiple iterations of the loop simultaneously.
Once unrolled, the hardware resources are scheduled as described in Scheduling.
The Intel® HLS Compiler never attempts to unroll any loops in your source code automatically. You must always control loop unrolling by using the corresponding pragma. For details, refer to Loop Unrolling (unroll Pragma) in the Intel® High Level Synthesis Compiler Reference Manual .
Conditional Statements
The Intel® HLS Compiler attempts to eliminate conditional or branch statements as much as possible.
Conditionally executed code becomes predicated in the hardware. Predication increases the possibilities for executing operations simultaneously and achieving better performance. Additionally, removing branches allows the compiler to apply other optimizations to the design.
In this example, the function foo can be run unconditionally. The code that cannot be run unconditionally, like the memory assignments, retain a condition.
3.3.3.1.2. Pipelining
Similar to the implementation of a CPU with multiple pipeline stages, the compiler generates a deeply-pipelined hardware datapath. For more information, refer to Concepts of FPGA Hardware Design and How Source Code Becomes a Custom Hardware Datapath.
Pipelining allows for many data items to be processed concurrently (in the same clock cycle) while making efficient use of the hardware in the datapath by keeping it occupied.
Pipelining and Vectorizing a Pipelined Datapath
Consider the following example of code mapping to hardware:
Multiple invocations of this code when running on a CPU would not be pipelined. The output of an invocation is completed before inputs are passed to the next invocation of the code.
Understanding where the data you need to pipeline is coming from is key to achieving high performance designs on the FPGA. You can use the following sources of data to take advantage of pipelining:
- Components
- Loop iterations
Pipelining Loops Within a Component
Within a component, loops are the primary source of pipeline parallelism.
When the Intel® HLS Compiler pipelines a loop, it attempts to schedule the loop execution such that the next iteration of the loop enters the pipeline before the previous iteration has completed. This pipelining of loop iterations can lead to higher throughput.
The number of clock cycles between iterations of the loop is called the Initiation Interval (II).
For the highest performance, a loop iteration would start every clock cycle, which corresponds to an II of 1.
Data dependencies that are carried from one loop iteration to another can affect the ability to achieve II of 1. These dependencies are called loop-carried dependencies.
The II of a loop must be high enough to accommodate all loop carried dependencies.
The Intel® HLS Compiler automatically identifies these dependencies and tries to build hardware to resolve them while minimizing the II, subject to the target fMAX.
Naively generating hardware for the code in Figure 17 results in two loads: one from memory b and one from memory c. Because the compiler knows that the access to c[i-1] was written to in the previous iteration, the load from c[i-1] can be optimized away.
The dependency on the value stored to c in the previous iteration is resolved in a single clock cycle, so an II of 1 is achieved for the loop even though the iterations are not independent.
For additional information about pipelining loops, refer to Pipeline Loops.
When the Intel® HLS Compiler cannot initially achieve II of 1, it chooses from several optimization strategies:
-
Interleaving: When a loop nest
has
an inner loop II that is greater than 1, the
Intel® HLS Compiler can attempt to interleave iterations of the outer loop into iterations of the inner
loop to better utilize the hardware resources and achieve higher throughput. Figure 18. Interleaving
For additional information about controlling interleaving in your component, refer to Loop Interleaving Control (max_interleaving Pragma) in the Intel® High Level Synthesis Compiler Reference Manual .
-
Speculative
Execution: When the critical path that affects II is the computation of the exit
condition and not a loop-carried dependency, the
Intel® HLS Compiler can attempt to relax this scheduling constraint by
speculatively continuing to execute iterations of the loop while the exit condition is
being computed.
If it is determined that the exit condition is satisfied, the effects of these extra iterations are suppressed.
This speculative execution can achieve lower II and higher throughput, but it can incur additional overhead between loop invocations (equivalent to the number of speculated iterations). A larger loop trip count helps to minimize this overhead.
- Terminology Reminder
- A loop invocation is what starts a series of loop iterations. One loop iteration is one execution of the body of a loop.
Figure 19. Loop Orchestration Without Speculative Execution
Figure 20. Loop Orchestration With Speculative Execution
For additional information about speculation, refer to Loop Iteration Speculation (speculated_iterations Pragma) in the Intel® High Level Synthesis Compiler Reference Manual .
These optimizations are applied automatically by the Intel® HLS Compiler, and additionally can be controlled through pragma statements in the design.
Pipelining Across Component Invocations
The pipelining of work across component invocations is similar to how loops are pipelined.
The Intel® HLS Compiler attempts to schedule the execution of component invocations such that the next invocation of a component enters the pipeline before the previous invocation has completed.
3.3.3.2. Task Parallelism
The compiler achieves concurrency by scheduling independent individual operations to execute simultaneously, but it does not achieve concurrency at coarser granularities (for example, across loops).
For larger code structures to execute in parallel with each other, you must write them as separate components or tasks that launch simultaneously. These components or tasks then run independently, and synchronize and communicate using pipes or streams, as shown in the following figure:
For details, see Systems of Tasks in the Intel® High Level Synthesis Compiler Pro Edition Reference Manual .
3.3.4. Memory Types
-
Component Memory
Component memory is memory allocated from memory resources (such as RAM blocks) available on the FPGA.
-
External Memory
External memory is memory resources that are outside of the FPGA.
3.3.4.1. Component Memory
If you declare an array inside your component, the Intel® HLS Compiler creates component memory in hardware. Component memory is sometimes referred to as local memory or on-chip memory because it is created from memory resources (such as RAM blocks) available on the FPGA.
The following source code snippet results in the creation of a component memory system, an interface to an external memory system, and access to these memory systems:
#include <HLS/hls.h> constexpr int SIZE = 128; constexpr int N = SIZE - 1; using MasterInterface = ihc::mm_master<int, ihc::waitrequest<true>, ihc::latency<0>>; component void memoryComponent(MasterInterface &masterA) { hls_memory int T[SIZE]; // declaring an array as a component memory for (unsigned i = 0; i < SIZE; i++) { T[i] = i; // writing to component memory } for (int i = 0; i < N; i += 2) { // reading from a component memory and writing to a external // Avalon memory-mapped slave component through an Avalon // memory-mapped master interface masterA[i] = T[i] + T[i + 1]; } }
The compiler performs the following tasks to build a memory system:
- Build a component memory from FPGA memory resources (such as block RAMs) and presents it to the datapath as a single memory.
- Map each array access to a load-store unit (LSU) in the datapath that transacts with the component memory through its ports.
- Automatically optimizes the component memory geometry to maximize the bandwidth available to loads and stores in the datapath.
- Attempts to guarantee that component memory accesses never stall.
Stallable and Stall-Free Memory Systems
- Stall-free memory access
- A memory access is stall-free if it has contention-free access to a memory port. A memory system is stall-free if each of its memory operations has contention-free access to a memory port.
- Stallable memory access
- A memory access is stallable if it does not have contention free access to a memory port. When two datapath LSUs try to transact with a memory port in the same clock cycle, one of those memory accesses is delayed (or stalled) until the memory port in contention becomes available.
As much as possible, the Intel® HLS Compiler tries to create stall-free memory systems for your component.
-
A: A stall-free memory system
This memory system is stall-free because, even though the reads are scheduled in the same cycle, they are mapped to different ports. There is no contention for accessing the memory ports.
-
B: A stall-free memory system
This memory system is stall-free because the two reads are statically-scheduled to occur in different clock cycles. The two reads can share a memory port without any contention for the read access.
-
C: A stallable memory system
This memory system is stallable because two reads are mapped to the same port in the same cycle. The two reads happen at the same time. There reads require collision arbitration to manage their port access requests, and arbitration can affect throughput.
- Port
- A memory port is a physical access point into a memory. A port is connected to one or more load-store units (LSUs) in the datapath. An LSU can connect to one or more ports. A port can have one or more LSUs connected.
- Bank
- A memory bank is a division of the
component memory system that contains of subset of the data stored. That is, all the
of the data stored for a component is split across banks, with each bank containing a
unique piece of the stored data.
A memory system always has at least one bank.
- Replicate
- A memory bank replicate is a copy of
the data in the memory bank with its own ports. All replicates in a bank contain the
same data. Each replicate can be accessed independent of the others
A memory bank always has at least one replicate.
- Private Copy
- A private copy is a copy of the data in
a replicate that is created for nested loops to enable concurrent iterations of the
outer loop.
A replicate can comprise multiple private copies, with each iteration of an outer loop having its own private copy. Because each outer loop iteration has its own private copy, private copies are not expected to all contain the same data.
The following figure illustrates the relationship between banks, replicates, ports, and private copies:
Strategies that Enable Concurrent Stall-Free Memory Accesses
The compiler uses a variety of strategies to ensure that concurrent accesses are stall-free including:
- Adjusting the number of ports
the memory system has. This can be done either by replicating the memory to enable more
read ports or by
clocking the RAM
block at twice the component clock speed, which
enables
four ports per
replicate instead of
two.
Clocking the RAM block at twice the component clock speed to double the number of available ports to the memory system is called double pumping.
All of a replicate's physical access ports can be accessed concurrently.
- Partitioning memory content
into one or more banks, such that each bank contains a subset of the data contained in the
original memory (corresponds to the top-right box of Schematic Representation of Local Memories Showing the Relationship between
Banks, Replicates, Ports, and Private Copies).
The banks of a component memory can be accessed concurrently by the datapath.
- Replicating a bank to create
multiple coherent replicates (corresponds to the bottom-left box of Schematic Representation of Local Memories Showing the Relationship between
Banks, Replicates, Ports, and Private Copies). Each replicate in a bank contains
identical data.
The replicates are loaded concurrently.
- Creating private copies
of an array that is
declared inside of a loop
nest
(corresponds to the bottom-right box of Schematic Representation of Local Memories Showing the Relationship between
Banks, Replicates, Ports, and Private Copies).
These private copies enable loop pipelining because each pipeline-parallel loop iteration accesses it own private copy of the array declared within the loop body. Private copies are not expected to contain the same data.
Despite the compiler’s best efforts, the component memory system can still be stallable. This might happen due to resource constraints or memory attributes defined in your source code. In that case, the compiler tries to minimize the hardware resources consumed by the arbitrated memory system.
3.3.4.2. External Memory
If the component accesses memory outside of the component, the compiler creates a hardware interface through which the datapath accesses this external memory. The interface is described using a pointer or Avalon® memory-mapped master interface as a function argument to the component. One interface is created for every pointer or memory-mapped master interface component argument.
The code snippet in Component Memory shows an external memory described with an Avalon® memory-mapped master interface and its accesses within the component.
Unlike component memory, the compiler does not define the structure of the external memory. The compiler instantiates a specialized LSU for each access site based on the type of interface and the memory access patterns.
The compiler also tries various strategies to maximize the efficient use of the available memory interface bandwidth such as eliminating unnecessary accesses and statically coalescing contiguous accesses.
4. Interface Best Practices
With the Intel® High Level Synthesis Compiler, your component can have a variety of interfaces: from basic wires to the Avalon Streaming and Avalon Memory-Mapped Master interfaces. Review the interface best practices to help you choose and configure the right interface for your component.
Each interface type supported by the Intel® HLS Compiler Pro Edition has different benefits. However, the system that surrounds your component might limit your choices. Keep your requirements in mind when determining the optimal interface for your component.
Demonstrating Interface Best Practices
The Intel® HLS Compiler Pro Edition comes with a number of tutorials that illustrate important Intel® HLS Compiler concepts and demonstrate good coding practices.
Tutorial | Description |
---|---|
You can find
these tutorials in the following location on your
Intel®
Quartus® Prime
system:<quartus_installdir>/hls/examples/tutorials |
|
interfaces/ overview | Demonstrates the effects on quality-of-results (QoR) of choosing different component interfaces even when the component algorithm remains the same. |
best_practices/ const_global |
Demonstrates the performance and resource utilization improvements of using const qualified global variables. Also demonstrates the type of interface created when you access global variables. |
best_practices/ parameter_aliasing |
Demonstrates the use of the __restrict keyword on component arguments |
best_practices/ lsu_control | Demonstrates the effects of controlling the type of LSUs instantiated for variable-latency Avalon® Memory Mapped Master interfaces |
interfaces/ explicit_streams_buffer |
Demonstrates how to use explicit stream_in and stream_out interfaces in the component and testbench. |
interfaces/ explicit_streams_packets_empty | Demonstrates how to use the usesPackets, usesEmpty, and firstSymbolInHighOrderBits stream template parameters. |
interfaces/explicit_streams_packets_ready_valid | Demonstrates how to use the usesPackets, usesValid, and usesReady stream template parameters. |
interfaces/ mm_master_testbench_operators | Demonstrates how to invoke a component at different indicies of an Avalon Memory Mapped (MM) Master (mm_master class) interface. |
interfaces/ mm_slaves | Demonstrates how to create Avalon-MM Slave interfaces (slave registers and slave memories). |
interfaces/ mm_slaves_csr_volatile | Demonstrates the effect of using volatile keyword to allow concurrent slave memory accesses while your component is running. |
interfaces/ mm_slaves_double_buffering | Demonstrates the effect of using the hls_readwrite_mode macro to control how memory masters access the slave memories |
interfaces/ multiple_stream_call_sites | Demonstrates the tradeoffs of using multiple stream call sites. |
interfaces/ pointer_mm_master | Demonstrates how to create Avalon-MM Master interfaces and control their parameters. |
interfaces/ stable_arguments | Demonstrates how to use the stable attribute for unchanging arguments to improve resource utilization. |
4.1. Choose the Right Interface for Your Component
Different component interfaces can affect the quality of results (QoR) of your component without changing your component algorithm. Consider the effects of different interfaces before choosing the interface between your component and the rest of your design.
The best interface for your component might not be immediately apparent, so you might need to try different interfaces for your component to achieve the optimal QoR. Take advantage of the rapid component compilation time provided by the Intel® HLS Compiler Pro Edition and the resulting High Level Design reports to determine which interface gives you the optimal QoR for your component.
This section uses a vector addition example to illustrate the impact of changing the component interface while keeping the component algorithm the same. The example has two input vectors, vector a and vector b, and stores the result to vector c. The vectors have a length of N (which could be very large).
#pragma unroll 8 for (int i = 0; i < N; ++i) { c[i] = a[i] + b[i]; }
The Intel® HLS Compiler Pro Edition extracts the parallelism of this algorithm by pipelining the loops if no loop dependency exists. In addition, by unrolling the loop (by a factor of 8), more parallelism can be extracted.
Ideally, the generated component has a latency of N/8 cycles. In the examples in the following section, a value of 1024 is used for N, so the ideal latency is 128 cycles (1024/8).
The following sections present variations of this example that use different interfaces. Review these sections to learn how different interfaces affect the QoR of this component.
You can work your way through the variations of these examples by reviewing the tutorial available in <quartus_installdir>/hls/examples/tutorials/interfaces/overview.
4.1.1. Pointer Interfaces
Pointers in a component are implemented as Avalon® Memory Mapped ( Avalon® -MM) master interfaces with default settings. For more details about pointer parameter interfaces, see Intel HLS Compiler Default Interfaces in Intel® High Level Synthesis Compiler Pro Edition Reference Manual.
component void vector_add(int* a, int* b, int* c, int N) { #pragma unroll 8 for (int i = 0; i < N; ++i) { c[i] = a[i] + b[i]; } }

The following Loop Analysis report shows that the component has an undesirably high loop initiation interval (II). The II is high because vectors a, b, and c are all accessed through the same Avalon-MM Master interface. The Intel® HLS Compiler Pro Edition uses stallable arbitration logic to schedule these accesses, which results in poor performance and high FPGA area use.
In addition, the compiler cannot assume there are no data dependencies between loop iterations because pointer aliasing might exist. The compiler cannot determine that vectors a, b, and c do not overlap. If data dependencies exist, the Intel® HLS Compiler cannot pipeline the loop iterations effectively.

QoR Metric | Value |
---|---|
ALMs | 15593.5 |
DSPs | 0 |
RAMs | 30 |
fMAX (MHz)2 | 298.6 |
Latency (cycles) | 24071 |
Initiation Interval (II) (cycles) | ~508 |
1The compilation flow used to calculate the QoR metrics used Intel® Quartus® Prime Pro Edition Version 17.1. |
2The fMAX measurement was calculated from a single seed. |
4.1.2. Avalon Memory Mapped Master Interfaces
By default, pointers in a component are implemented as Avalon® Memory Mapped ( Avalon® MM) master interfaces with default settings. You can mitigate poor performance from the default settings by configuring the Avalon® MM master interfaces.
You can configure the Avalon® MM master interface for the vector addition component example using the ihc::mm_master class as follows:
component void vector_add( ihc::mm_master<int, ihc::aspace<1>, ihc::dwidth<8*8*sizeof(int)>, ihc::align<8*sizeof(int)> >& a, ihc::mm_master<int, ihc::aspace<2>, ihc::dwidth<8*8*sizeof(int)>, ihc::align<8*sizeof(int)> >& b, ihc::mm_master<int, ihc::aspace<3>, ihc::dwidth<8*8*sizeof(int)>, ihc::align<8*sizeof(int)> >& c, int N) { #pragma unroll 8 for (int i = 0; i < N; ++i) { c[i] = a[i] + b[i]; } }
- The vectors are each assigned to different address spaces with the ihc::aspace attribute, and each vector receives
a separate
Avalon®
MM master interface.
With the vectors assigned to different physical interfaces, the vectors can be accessed concurrently without interfering with each other, so memory arbitration is not needed.
- The width of the interfaces for the vectors is adjusted with the ihc::dwidth attribute.
- The alignment of the interfaces for the vectors is adjusted with the ihc::align attribute.

The diagram shows that vector_add.B2 has two loads and one store. The default Avalon® MM Master settings used by the code example in Pointer Interfaces had 16 loads and 8 stores.
By expanding the width and alignment of the vector interfaces, the original pointer interface loads and stores were coalesced into one wide load each for vector a and vector b, and one wide store for vector c.
Also, the memories are stall-free because the loads and stores in this example access separate memories.
QoR Metric | Pointer | Avalon MM Master |
---|---|---|
ALMs | 15593.5 | 643 |
DSPs | 0 | 0 |
RAMs | 30 | 0 |
fMAX (MHz)2 | 298.6 | 472.37 |
Latency (cycles) | 24071 | 142 |
Initiation Interval (II) (cycles) | ~508 | 1 |
1The compilation flow used to calculate the QoR metrics used Intel® Quartus® Prime Pro Edition Version 17.1. |
2The fMAX measurement was calculated from a single seed. |
4.1.3. Avalon Memory Mapped Slave Interfaces
When you allocate a slave memory, you must define its size. Defining the size puts a limit on how large a value of N that the component can process. In this example, the RAM size is 1024 words. This RAM size means that N can have a maximal size of 1024 words.
component void vector_add( hls_avalon_slave_memory_argument(1024*sizeof(int)) int* a, hls_avalon_slave_memory_argument(1024*sizeof(int)) int* b, hls_avalon_slave_memory_argument(1024*sizeof(int)) int* c, int N) { #pragma unroll 8 for (int i = 0; i < N; ++i) { c[i] = a[i] + b[i]; } }

QoR Metric | Pointer | Avalon® MM Master | Avalon® MM Slave |
---|---|---|---|
ALMs | 15593.5 | 643 | 490.5 |
DSPs | 0 | 0 | 0 |
RAMs | 30 | 0 | 48 |
fMAX (MHz)2 | 298.6 | 472.37 | 498.26 |
Latency (cycles) | 24071 | 142 | 139 |
Initiation Interval (II) (cycles) | ~508 | 1 | 1 |
1The compilation flow used to calculate the QoR metrics used Intel® Quartus® Prime Pro Edition Version 17.1. |
2The fMAX measurement was calculated from a single seed. |
4.1.4. Avalon Streaming Interfaces
Avalon® Streaming ( Avalon® ST) interfaces support a unidirectional flow of data, and are typically used for components that drive high-bandwidth and low-latency data.
struct int_v8 { int data[8]; }; component void vector_add( ihc::stream_in<int_v8>& a, ihc::stream_in<int_v8>& b, ihc::stream_out<int_v8>& c, int N) { for (int j = 0; j < (N/8); ++j) { int_v8 av = a.read(); int_v8 bv = b.read(); int_v8 cv; #pragma unroll 8 for (int i = 0; i < 8; ++i) { cv.data[i] = av.data[i] + bv.data[i]; } c.write(cv); } }
An Avalon® ST interface has a data bus, and ready and busy signals for handshaking. The struct is created to pack eight integers so that eight operations at a time can occur in parallel to provide a comparison with the examples for other interfaces. Similarly, the loop count is divided by eight.

The streaming interfaces are stallable from the upstream sources and the downstream output. Because the interfaces are stallable, the loop initiation interval (II) is approximately 1 (instead of exactly 1). If the component does not receive any bubbles (gaps in data flow) from upstream or stall signals from downstream, then the component achieves the desired II of 1.
If you know that the stream interfaces will never stall, you can further optimize this component by taking advantage of the usesReady and usesValid stream parameters.
QoR Metric | Pointer | Avalon® MM Master | Avalon® MM Slave | Avalon® ST |
---|---|---|---|---|
ALMs | 15593.5 | 643 | 490.5 | 314.5 |
DSPs | 0 | 0 | 0 | 0 |
RAMs | 30 | 0 | 48 | 0 |
fMAX (MHz)2 | 298.6 | 472.37 | 498.26 | 389.71 |
Latency (cycles) | 24071 | 142 | 139 | 134 |
Initiation Interval (II) (cycles) | ~508 | 1 | 1 | 1 |
1The compilation flow used to calculate the QoR metrics used Intel® Quartus® Prime Pro Edition Version 17.1. |
2The fMAX measurement was calculated from a single seed. |
4.1.5. Pass-by-Value Interface
For software developers accustomed to writing code that targets a CPU, passing each element in an array by value might be unintuitive because it typically results in many function calls or large parameters. However, for code that targets an FPGA device, passing array elements by value can result in smaller and simpler hardware on the FPGA device.
The vector addition example can be coded to pass the vector array elements by value as follows. A struct is used to pass the entire array (of 8 data elements) by value.
- Define element-wise copy constructors.
- Define element-wise copy assignment operators.
- Add the hls_register memory attribute to all struct members in the definition.
struct int_v8 { hls_register int data[8]; //copy assignment operator int_v8 operator=(const int_v8& org) { #pragma unroll for (int i=0; i< 8; i++) { data[i] = org.data[i] ; } return *this; } //copy constructor int_v8 (const int_v8& org) { #pragma unroll for (int i=0; i< 8; i++) { data[i] = org.data[i] ; } } //default construct & destructor int_v8() {}; ~int_v8() {}; }; component int_v8 vector_add( int_v8 a, int_v8 b) { int_v8 c; #pragma unroll 8 for (int i = 0; i < 8; ++i) { c.data[i] = a.data[i] + b.data[i]; } return c; }
This component takes and processes only eight elements of vector a and vector b, and returns eight elements of vector c. To compute 1024 elements for the example, the component needs to be called 128 times (1024/8). While in previous examples the component contained loops that were pipelined, here the component is invoked many times, and each of the invocations are pipelined.

QoR Metric | Pointer | Avalon® MM Master | Avalon® MM Slave | Avalon® ST | Pass-by-Value |
---|---|---|---|---|---|
ALMs | 15593.5 | 643 | 490.5 | 314.5 | 130 |
DSPs | 0 | 0 | 0 | 0 | 0 |
RAMs | 30 | 0 | 48 | 0 | 0 |
fMAX (MHz)2 | 298.6 | 472.37 | 498.26 | 389.71 | 581.06 |
Latency (cycles) | 24071 | 142 | 139 | 134 | 128 |
Initiation Interval (II) (cycles) | ~508 | 1 | 1 | 1 | 1 |
1The compilation flow used to calculate the QoR metrics used Intel® Quartus® Prime Pro Edition Version 17.1. |
2The fMAX measurement was calculated from a single seed. |
4.2. Control LSUs For Your Variable-Latency MM Master Interfaces
Controlling the type of load-store units (LSUs) that the Intel® HLS Compiler Pro Edition uses to interact with variable-latency Memory Mapped (MM) Master interfaces can help save area in your design. You might also encounter situations where disabling static coalescing of a load/store with other load/store operations benefits the performance of your design.
Review the following tutorial to learn about controlling LSUs: <quartus_installdir>/hls/examples/tutorials/best_practices/lsu_control.
To see if you need to use LSU controls, review the High-Level Design Reports for your component, especially the Function Memory Viewer, to see if the memory access pattern (and its associated LSUs) inferred by the Intel® HLS Compiler Pro Edition match your expected memory access pattern. If they do not match, consider controlling the LSU type, LSU coalescing, or both.
Control the Type of LSU Created
The Intel® HLS Compiler Pro Edition creates either burst-coalesced LSUs or pipelined LSUs.
In general, use burst-coalesced LSUs when an LSU is expected to process many load/store requests to memory words that are consecutive. The burst-coalesced LSU attempts to "dynamically coalesce" the requests into larger bursts in order to utilize memory bandwidth more efficiently.
The pipelined LSU consumes significantly less FPGA area, but processes load/store requests individually without any coalescing. This processing is useful when your design is tight on area or when the accesses to the variable-latency MM Master interface are not necessarily consecutive.
component void
dut(mm_master<int, dwidth<128>, awidth<32>, aspace<4>, latency<0>> &Buff1,
mm_master<int, dwidth<32>, awidth<32>, aspace<5>, latency<0>> &Buff2) {
int Temp[SIZE];
using pipelined = lsu<style<PIPELINED>>;
using burst_coalesced = lsu<style<BURST_COALESCED>>;
for (int i = 0; i<SIZE; i++) {
Temp[i] = burst_coalesced::load(&Buff1[i]); // Burst-Coalesced LSU
}
for (int i = 0; i<SIZE; i++) {
pipelined::store(&Buff2[i], 2*Temp[i]); // Pipelined LSU
}
}
Disable Static Coalescing
Static coalescing is typically beneficial because it reduces the total number of LSUs in your design by statically combining multiple load/store operations into wider load/store operations
However, there are cases where static coalescing leads to unaligned accesses, which you might not want to occur. There are also cases where multiple loads/stores get coalesced even though you intended for only a subset of them to be operational at a time. In these cases, consider disable static coalescing for the load/store operations that you did not want to be coalesced.
component int dut(mm_master<int, dwidth<256>, awidth<32>, aspace<1>, latency<0>> &Buff1, int i, bool Cond1, bool Cond2) { using no_coalescing = lsu<style<PIPELINED>, static_coalescing<false>>; int Val = 0; if (Cond1) { Val = no_coalescing::load(&Buff1[i]); } if (Cond2) { Val = no_coalescing::load(&Buff1[i + 1]); } return Val; }
4.3. Avoid Pointer Aliasing
Add a restrict type-qualifier to pointer types whenever possible. By having restrict-qualified pointers, you prevent the Intel® HLS Compiler Pro Edition from creating unnecessary memory dependencies between nonconflicting read and write operations.
The restrict type-qualifier is __restrict.
Consider a loop where each iteration reads data from one array, and then it writes data to another array in the same physical memory. Without adding the restrict type-qualifier to these pointer arguments, the compiler must assume that the two arrays might overlap. Therefore, the compiler must keep the original order of memory accesses to both arrays, resulting in poor loop optimization or even failure to pipeline the loop that contains the memory accesses.
You can also use the restrict type-qualifier with Avalon® memory-mapped (MM) master interfaces.
<quartus_installdir>/hls/examples/tutorials/best_practices/parameter_aliasing
5. Loop Best Practices
The Intel® HLS Compiler Pro Edition lets you know if there are any dependencies that prevent it from optimizing your loops. Try to eliminate these dependencies in your code for optimal component performance. You can also provide additional guidance to the compiler by using the available loop pragmas.
- Manually fuse adjacent loop bodies when the instructions in those loop bodies can be performed in parallel. These fused loops can be pipelined instead of being executed sequentially. Pipelining reduces the latency of your component and can reduce the FPGA area your component uses.
- Use the #pragma loop_coalesce directive to have the compiler attempt to collapse nested loops. Coalescing loops reduces the latency of your component and can reduce the FPGA area overhead needed for nested loops.
- If you have two loops that can execute in parallel, consider using a system of tasks. For details, see System of Tasks Best Practices.
Tutorials Demonstrating Loop Best Practices
The Intel® HLS Compiler Pro Edition comes with a number of tutorials that illustrate important Intel® HLS Compiler concepts and demonstrate good coding practices.
Tutorial | Description |
---|---|
You can find
these tutorials in the following location on your
Intel®
Quartus® Prime
system:<quartus_installdir>/hls/examples/tutorials |
|
best_practices/ divergent_loops | Demonstrates a source-level optimization for designs with divergent loops |
best_practices/ loop_coalesce | Demonstrates the performance and resource utilization improvements of using loop_coalesce pragma on nested loops. |
best_practices/ loop_fusion | Demonstrates the latency and resource utilization improvements of loop fusion. |
best_practices/ loop_memory_dependency | Demonstrates breaking loop-carried dependencies using the ivdep pragma. |
loop_controls/ max_interleaving |
Demonstrates a method to reduce the area
utilization of a loop that meets the following conditions:
|
best_practices/ optimize_ii_using_hls_register | Demonstrates how to use the hls_register attribute to reduce loop II and how to use hls_max_concurrency to improve component throughput |
best_practices/ parallelize_array_operation | Demonstrates how to improve fMAX by correcting a bottleneck that arises when performing operations on an array in a loop. |
best_practices/ relax_reduction_dependency |
Demonstrates a method to reduce the II of a loop that includes a floating point accumulator, or other reduction operation that cannot be computed at high speed in a single clock cycle. |
best_practices/ remove_loop_carried_dependency | Demonstrates how to improve loop performance by removing accesses to the same variable across nested loops. |
best_practices/ resource_sharing_filter | Demonstrates the following versions of a
32-tap finite impulse response (FIR) filter design:
|
best_practices/ triangular_loop | Demonstrates a method for describing triangular loop patterns with dependencies. |
5.1. Reuse Hardware By Calling It In a Loop
Loops are a useful way to reuse hardware. If your component function calls another function, the called function will be the top-level component. Calling a function multiple times results in hardware duplication.
int foo(int a) { return 4 + sqrt(a) / } component void myComponent() { ... int x = x += foo(0); x += foo(1); x += foo(2); ... }
component void myComponent() { ... int x = 0; #pragma unroll 1 for (int i = 0; i < 3; i++) { x += foo(i); } ... }
component void myComponent() { ... int x = 0; #pragma unroll 1 for (int i = 0; i < 3; i++) { int val = 0; switch(i) { case 0: val = 3; break; case 1: val = 6; break; case 2: val = 1; break; } x += foo(val); } ... } H
You can learn more about reusing hardware and minimizing inlining be reviewing the resource sharing tutorial available in <quartus_installdir>/hls/examples/tutorials/best_practices/resource_sharing_filter.
5.2. Parallelize Loops
You can take advantage of the spatial compute structure to accelerate the loops by having multiple iterations of a loop executing concurrently. To have multiple iterations of a loop execute concurrently, unroll loops when possible and structure your loops so that dependencies between loop iterations are minimized and can be resolved within one clock cycle.
These practices show how to parallelize different iterations of the same loop. If you have two different loops that you want to parallelize, consider using a system of tasks. For details, see System of Tasks Best Practices.
5.2.1. Pipeline Loops


This loop is pipelined with a loop initiation interval (II) of 1. An II of 1 means that there is a delay of 1 clock cycle between starting each successive loop iteration.
The Intel® HLS Compiler Pro Edition attempts to pipeline loops by default, and loop pipelining is not subject to the same constant iteration count constraint that loop unrolling is.
Not all loops can be pipelined as well as the loop shown in Figure 30, particularly loops where each iteration depends on a value computed in a previous iteration.
For example, consider if Stage 1 of the loop depended on a value computed during Stage 3 of the previous loop iteration. In that case, the second (orange) iteration could not start executing until the first (blue) iteration had reached Stage 3. This type of dependency is called a loop-carried dependency.
In this example, the loop would be pipelined with II=3. Because the II is the same as the latency of a loop iteration, the loop would not actually be pipelined at all. You can estimate the overall latency of a loop with the following equation:
where is the number of cycles the loop takes to execute and is the number of cycles a single loop iteration takes to execute.
The Intel® HLS Compiler Pro Edition supports pipelining nested loops without unrolling inner loops. When calculating the latency of nested loops, apply this formula recursively. This recursion means that having II>1 is more problematic for inner loops than for outer loops. Therefore, algorithms that do most of their work on an inner loop with II=1 still perform well, even if their outer loops have II>1.
5.2.2. Unroll Loops


You can control how the compiler unrolls a loop with the #pragma unroll directive, but this directive works only if the compiler knows the trip count for the loop in advance or if you specify the unroll factor. In addition to replicating the hardware, the compiler also reschedules the circuit such that each operation runs as soon as the inputs for the operation are ready.
For an example of using the #pragma unroll directive, see the best_practices/resource_sharing_filter tutorial.
5.2.3. Example: Loop Pipelining and Unrolling
1. #define ROWS 4 2. #define COLS 4 3. 4. component void dut(...) { 5. float a_matrix[COLS][ROWS]; // store in column-major format 6. float r_matrix[ROWS][COLS]; // store in row-major format 7. 8. // setup... 9. 10. for (int i = 0; i < COLS; i++) { 11. for (int j = i + 1; j < COLS; j++) { 12. 13. float dotProduct = 0; 14. for (int mRow = 0; mRow < ROWS; mRow++) { 15. dotProduct += a_matrix[i][mRow] * a_matrix[j][mRow]; 16. } 17. r_matrix[i][j] = dotProduct; 18. } 19. } 20. 21. // continue... 22. 23. }
You can improve the performance of this component by unrolling the loops that iterate across each entry of a particular column. If the loop operations are independent, then the compiler executes them in parallel.
Floating-point operations typically must be carried out in the same order that they are expressed in your source code to preserve numerical precision. However, you can use the -ffp-contract=fast compiler flag to relax the ordering of floating-point operations. With the order of floating-point operations relaxed, all of the multiplications in this loop can occur in parallel. To learn more, review the tutorial: <quartus_installdir>/hls/examples/ tutorials/best_practices/ floating_point_ops
The compiler tries to unroll loops on its own when it thinks unrolling improves performance. For example, the loop at line 14 is automatically unrolled because the loop has a constant number of iterations, and does not consume much hardware (ROWS is a constant defined at compile-time, ensuring that this loop has a fixed number of iterations).
01: #define ROWS 4 02: #define COLS 4 03: 04: component void dut(...) { 05: float a_matrix[COLS][ROWS]; // store in column-major format 06: float r_matrix[ROWS][COLS]; // store in row-major format 07: 08: // setup... 09: 10: for (int i = 0; i < COLS; i++) { 11: 12: #pragma unroll 13: for (int j = 0; j < COLS; j++) { 14: float dotProduct = 0; 15: 16: #pragma unroll 17: for (int mRow = 0; mRow < ROWS; mRow++) { 18: dotProduct += a_matrix[i][mRow] * a_matrix[j][mRow]; 19: } 20: 21: r_matrix[i][j] = (j > i) ? dotProduct : 0; // predication 22: } 23: } 24: } 25: 26: // continue... 27: 28: }
Now the j-loop is fully unrolled. Because they do not have any dependencies, all four iterations run at the same time.
Refer to the resource_sharing_filter tutorial located at <quartus_installdir>/hls/examples/tutorials/best_practices/resource_sharing_filter for more details.
You could continue and also unroll the loop at line 10, but unrolling this loop would result in the area increasing again. By allowing the compiler to pipeline this loop instead of unrolling it, you can avoid increasing the area and pay about only four more clock cycles assuming that the i-loop only has an II of 1. If the II is not 1, the Details pane of the Loops Analysis page in the high-level design report (report.html) gives you tips on how to improve it.
- loop-carried dependencies
See the tutorial at <quartus_installdir>/hls/examples/tutorials/best_practices/loop_memory_dependency
- long critical loop path
- inner loops with a loop II > 1
5.3. Construct Well-Formed Loops
A well-formed loop has an exit condition that compares against an integer bound and has a simple induction increment of one per iteration. The Intel® HLS Compiler Pro Edition can analyze well-formed loops efficiently, which can help improve the performance of your component.
for(int i=0; i < N; i++) { //statements }
Well-formed nested loops can also help maximize the performance of your component.
for(int i=0; i < N; i++) { //statements for(int j=0; j < M; j++) { //statements } }
5.4. Minimize Loop-Carried Dependencies
The loop structure that follows has a loop-carried dependency because each loop iteration reads data written by the previous iteration. As a result, each read operation cannot proceed until the write operation from the previous iteration completes. The presence of loop-carried dependencies reduces the pipeline parallelism that the Intel® HLS Compiler Pro Edition can achieve, which reduces component performance.
for(int i = 1; i < N; i++) { A[i] = A[i - 1] + i; }
The Intel® HLS Compiler Pro Edition performs a static memory dependency analysis on loops to determine the extent of parallelism that it can achieve. If the Intel® HLS Compiler Pro Edition cannot determine that there are no loop-carried dependencies, it assumes that loop-dependencies exist. The ability of the compiler to test for loop-carried dependencies is impeded by unknown variables at compilation time or if array accesses in your code involve complex addressing.
To avoid unnecessary loop-carried dependencies and help the compiler to better analyze your loops, follow these guidelines:
Avoid Pointer Arithmetic
Compiler output is suboptimal when your component accesses arrays by dereferencing pointer values derived from arithmetic operations. For example, avoid accessing an array as follows:
for(int i = 0; i < N; i++) { int t = *(A++); *A = t; }
Introduce Simple Array Indexes
Some types of complex array indexes cannot be analyzed effectively, which might lead to suboptimal compiler output. Avoid the following constructs as much as possible:- Nonconstants in array indexes.
For example, A[K + i], where i is the loop index variable and K is an unknown variable.
- Multiple index variables in the same subscript location.
For example, A[i + 2 × j], where i and j are loop index variables for a double nested loop.
The array index A[i][j] can be analyzed effectively because the index variables are in different subscripts.
- Nonlinear indexing.
For example, A[i & C], where i is a loop index variable and C is a nonconstant variable.
Use Loops with Constant Bounds Whenever Possible
The compiler can perform range analysis effectively when loops have constant bounds.
You can place an if-statement inside your loop to control in which iterations the loop body executes.
Ignore Memory Dependencies
If there are no implicit memory dependencies across loop iterations, you can use the ivdep pragma to tell the Intel® HLS Compiler Pro Edition to ignore possible memory dependencies.
For details about how to use the ivdep pragma, see Loop-Carried Dependencies (ivdep Pragma) in the Intel® High Level Synthesis Compiler Pro Edition Reference Manual.
5.5. Avoid Complex Loop-Exit Conditions
If a loop in your component has complex exit conditions, memory accesses or complex operations might be required to evaluate the condition. Subsequent iterations of the loop cannot launch in the loop pipeline until the evaluation completes, which can decrease the overall performance of the loop.
Use the speculated_iterations pragma to specify how many cycles the loop exit condition can take to compute.
5.6. Convert Nested Loops into a Single Loop
To maximize performance, combine nested loops into a single loop whenever possible. The control flow for a loop adds overhead both in logic required and FPGA hardware footprint. Combining nested loops into a single loop reduces these aspects, improving the performance of your component.
The following code examples illustrate the conversion of a nested loop into a single loop:
Nested Loop | Converted Single Loop |
---|---|
for (i = 0; i < N; i++) { //statements for (j = 0; j < M; j++) { //statements } //statements } |
for (i = 0; i < N*M; i++) { //statements } |
You can also specify the loop_coalesce pragma to coalesce nested loops into a single loop without affecting the loop functionality. The following simple example shows how the compiler coalesces two loops into a single loop when you specify the loop_coalesce pragma.
#pragma loop_coalesce for (int i = 0; i < N; i++) for (int j = 0; j < M; j++) sum[i][j] += i+j;
int i = 0; int j = 0; while(i < N){ sum[i][j] += i+j; j++; if (j == M){ j = 0; i++; } }
For more information about the loop_coalesce pragma, see "Loop Coalescing (loop_coalesce Pragma)" in Intel® High Level Synthesis Compiler Pro Edition Reference Manual.
You can also review the following tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/loop_coalesce
5.7. Place if-Statements in the Lowest Possible Scope in a Loop Nest
If you have a nests of loops, avoid placing loops within conditional statements.
These conditions can cause the outer loop to take different paths (divergent loops), which can reduce the QoR of your component because these condition prevent the Intel® HLS Compiler from pipelining the loops.
for (int row = 0; row < outerTripCount; row++) { if (loopCondition) { for (int col = 0; col < innerTripCount; col++) { foo(); } } else { for (int col = 0; col < innerTripCount; col++) { bar(); } } }
for (int row = 0; row < outerTripCount; row++) { for (int col = 0; col < innerTripCount; col++) { if (loopCondition) { foo(); } else { bar(); } } }
<quartus_installdir>/hls/examples/tutorials/best_practices/divergent_loops
5.8. Declare Variables in the Deepest Scope Possible
To reduce the FPGA hardware resources necessary for implementing a variable, declare the variable just before you use it in a loop. Declaring variables in the deepest scope possible minimizes data dependencies and FPGA hardware usage because the Intel® HLS Compiler Pro Edition does not need to preserve the variable data across loops that do not use the variables.
Consider the following example:
int a[N]; for (int i = 0; i < m; ++i) { int b[N]; for (int j = 0; j < n; ++j) { // statements } }
The array a requires more resources to implement than the array b. To reduce hardware usage, declare array a outside the inner loop unless it is necessary to maintain the data through iterations of the outer loop.
5.9. Raise Loop II to Increase fMAX
If you have a loop that does not affect the throughput of your component, you can raise the initiation interval (II) of the loop with the ii pragma to try and increase the fMAX of your design.
Example
Consider a case where your component has two distinct sequential pipelineable loops: an initialization loop with a low trip count and a processing loop with a high trip count and no loop-carried memory dependencies. In this case, the compiler does not know that the initialization loop has a much smaller impact on the overall throughput of your design. If possible, the compiler attempts to pipeline both loops with an II of 1.
Because the initialization loop has a loop-carried dependence, it will have a feedback path in the generated hardware. To achieve an II with such a feedback path, some clock frequency might be sacrificed. Depending on the feedback path in the main loop, the rest of your design could have run at a higher operating frequency.
If you specify #pragma ii 2 on the initialization loop, you tell the compiler that it can be less aggressive in optimizing II for this loop. Less aggressive optimization allows the compiler to pipeline the path limiting the fmax and could allow your overall component design to achieve a higher fmax.
The initialization loop takes longer to run with its new II. However, the decrease in the running time of the long-running loop due to higher fmax compensates for the increased length in running time of the initialization loop.
5.10. Control Loop Interleaving
The initiation interval (II) of a loop is the statically determined number of cycles between successive iteration launches of a given loop invocation. However, the statically scheduled II may differ from the realized dynamic II when considering interleaving.
With loop interleaving, the dynamic II of a loop can be approximated by the static II of the loop divided by the degree of interleaving, that is, by the number of concurrent invocations of the loop that are in flight.
Interleaving allows the iterations of more than one invocation of a loop to execute in parallel, provided that the static II of that loop is greater than 1. By default, the maximum amount of interleaving for a loop is equal to the static II of that loop.
In the presence of interleaving, the dynamic II of a loop can be approximated by the static II of the loop divided by the degree of interleaving, that is, by the number of concurrent invocations of the loop that are in flight.
Review the following tutorial to learn more about loop interleaving and how to control it: <quartus_installdir>/hls/examples/tutorials/loop_controls/ max_interleaving.
6. fMAX Bottleneck Best Practices
Tutorials Demonstrating fMAX Bottleneck Best Practices
The Intel® HLS Compiler Pro Edition comes with a number of tutorials that illustrate important Intel® HLS Compiler concepts and demonstrate good coding practices.
Tutorial | Description |
---|---|
You can find these tutorials in the
following location on your
Intel®
Quartus® Prime
system:<quartus_installdir>/hls/examples/tutorials |
|
best_practices/ fpga_reg | Demonstrates how manually adding pipeline registers can increase fMAX |
best_practices/ overview | Demonstrates how fMAX can depend on the interface used in your component. |
best_practices/ parallelize_array_operation | Demonstrates how to improve fMAX by correcting a bottleneck that arises when performing operations on an array in a loop. |
best_practices/ reduce_exit_fifo_width | Demonstrates how to improve fMAX by reducing the width of the FIFO belonging to the exit node of a stall-free cluster |
best_practices/ relax_reduction_dependency | Demonstrates how fMAX can depend on the loop-carried feedback path. |
6.1. Balancing Target fMAX and Target II
The compiler attempts to optimize the component for different objectives for the scheduled fMAX depending on whether the fMAX target is set and whether the #pragma II is set for each of the loops.
The fMAX target is a strong suggestion and the compiler does not error out if it is not able to achieve this fMAX, whereas the #pragma II triggers an error if the compiler cannot achieve the requested II. The fMAX achieved for each block of code is shown in the Loops report.
The following table outlines the behavior of the scheduler in the Intel® HLS Compiler:
Explicitly Specify fMAX? | Explicitly Specify II? | Compiler Behavior |
---|---|---|
No | No | Use heuristic to achieve best fMAX/II trade-off. |
No | Yes | Best effort to achieve the II for the corresponding loop (may not achieve the best possible fMAX). |
Yes | No | Best effort to achieve fMAX specified (may not achieve the best possible II). |
Yes | Yes | Best effort to achieve the fMAX specified at the given II. The compiler errors out if it cannot achieve the requested II. |
7. Memory Architecture Best Practices
In most cases, you can optimize the memory architecture by modifying the access pattern. However, the Intel® HLS Compiler Pro Edition gives you some control over the memory architecture.
Tutorials Demonstrating Memory Architecture Best Practices
The Intel® HLS Compiler Pro Edition comes with a number of tutorials that illustrate important Intel® HLS Compiler concepts and demonstrate good coding practices.
Tutorial | Description |
---|---|
You can find
these tutorials in the following location on your
Intel®
Quartus® Prime
system:<quartus_installdir>/hls/examples/tutorials/component_memories |
|
attributes_on_mm_slave_arg | Demonstrates how to apply memory attributes to Avalon® Memory Mapped (MM) slave arguments. |
exceptions | Demonstrates how to use memory attributes on constants and struct members. |
memory_bank_configuration | Demonstrates how to control the number
of load/store ports of each memory bank and optimize your
component area usage, throughput, or both by using one or more
of the following memory attributes:
|
memory_geometry | Demonstrates how to control the number
of load/store ports of each memory bank and optimize your
component area usage, throughput, or both by using one or more
of the following memory attributes:
|
memory_implementation | Demonstrates how to implement variables
or arrays in registers, MLABs, or RAMs by using the following
memory attributes:
|
memory_merging | Demonstrates how to improve resource utilization by implementing two logical memories as a single physical memory by merging them depth-wise or width-wise with the hls_merge memory attribute. |
non_trivial_initialization | Demonstrates how to use the C++ keyword constexpr to achieve efficient initialization of read-only variables. |
non_power_of_two_memory | Demonstrates how to use the force_pow2_depth memory attribute to control the padding of memories that are non-power-of-two deep, and how that impacts the FPGA memory resource usage. |
static_var_init | Demonstrates how to control the initialization behavior of statics in a component using the hls_init_on_reset or hls_init_on_powerup memory attribute. |
7.1. Example: Overriding a Coalesced Memory Architecture
Using memory attributes in various combinations in your code allows you to override the memory architecture that the Intel® HLS Compiler Pro Edition infers for your component.
The following code examples demonstrate how you can use the following memory attributes to override coalesced memory to conserve memory blocks on your FPGA:
- hls_bankwidth(N)
- hls_numbanks(N)
- hls_singlepump
- hls_max_replicates(N)
The original code coalesces two memory accesses, resulting in a memory system that is 256 locations deep by 64 bits wide (256x64 bits) (two on-chip memory blocks):
component unsigned int mem_coalesce_default(unsigned int raddr, unsigned int waddr, unsigned int wdata){ unsigned int data[512]; data[2*waddr] = wdata; data[2*waddr + 1] = wdata + 1; unsigned int rdata = data[2*raddr] + data[2*raddr + 1]; return rdata; }
The following images show how the 256x64 bit memory for this code sample is structured, as well how the component memory structure is shown in the high-level design report (report.html)
![]() |
The modified code implements a single on-chip memory block that is 512 words deep by 32 bits wide with stallable arbitration:
component unsigned int mem_coalesce_override(unsigned int raddr, unsigned int waddr, unsigned int wdata){ //Attributes that stop memory coalescing hls_bankwidth(4) hls_numbanks(1) //Attributes that specify a single-pumped single-replicate memory hls_singlepump hls_max_replicates(1) unsigned int data[512]; data[2*waddr] = wdata; data[2*waddr + 1] = wdata + 1; unsigned int rdata = data[2*raddr] + data[2*raddr + 1]; return rdata; }
The following images show how the 512x32 bit memory with stallable arbitration for this code sample is structured, as well how the component memory structure is shown in the high-level design report (report.html).
![]() |
While it might appear that you save hardware area by reducing the number of RAM blocks needed for the component, the introduction of stallable arbitration increases the amount of hardware needed to implement the component. In the following table, you can compare the number ALMs and FFs required by the components.

7.2. Example: Overriding a Banked Memory Architecture
Using memory attributes in various combinations in your code allows you to override the memory architecture that the Intel® HLS Compiler Pro Edition infers for your component.
The following code examples demonstrate how you can use the following memory attributes to override banked memory to conserve memory blocks on your FPGA:
- hls_bankwidth(N)
- hls_numbanks(N)
- hls_singlepump
- hls_doublepump
The original code creates two banks of single-pumped on-chip memory blocks that are 16 bits wide:
component unsigned short mem_banked(unsigned short raddr, unsigned short waddr, unsigned short wdata){ unsigned short data[1024]; data[2*waddr] = wdata; data[2*waddr + 9] = wdata +1; unsigned short rdata = data[2*raddr] + data[2*raddr + 9]; return rdata; }
To save banked memory, you can implement one bank of double-pumped 32-bit-wide on-chip memory block by adding the following attributes before the declaration of data[1024]. These attributes fold the two half-used memory banks into one fully-used memory bank that is double pumped, so that it can be accessed as quickly as the two half-used memory banks.
hls_bankwidth(2) hls_numbanks(1) hls_doublepump unsigned short data[1024];
Alternatively, you can avoid the double-clock requirement of the double-pumped memory by implementing one bank of single-pumped on-chip memory block by adding the following attributes before the declaration of data[1024]. However, in this example, these attributes add stallable arbitration to your component memories, which hurts your component performance.
hls_bankwidth(2) hls_numbanks(1) hls_singlepump unsigned short data[1024];
7.3. Merge Memories to Reduce Area
In some cases, you can save FPGA memory blocks by merging your component memories so that they consume fewer memory blocks, reducing the FPGA area your component uses. Use the hls_merge attribute to force the Intel® HLS Compiler Pro Edition to implement different variables in the same memory system.
When you merge memories, multiple component variables share the same memory block. You can merge memories by width (width-wise merge) or depth (depth-wise merge). You can merge memories where the data in the memories have different datatypes.
The following diagram shows how four memories can be merged width-wise and depth-wise.
7.3.1. Example: Merging Memories Depth-Wise
Use the hls_merge("<mem_name>","depth") attribute to force the Intel® HLS Compiler Pro Edition to implement variables in the same memory system, merging their memories by depth.
All variables with the same <mem_name> label set in their hls_merge attributes are merged.
Consider the following component code:
component int depth_manual(bool use_a, int raddr, int waddr, int wdata) { int a[128]; int b[128]; int rdata; // mutually exclusive write if (use_a) { a[waddr] = wdata; } else { b[waddr] = wdata; } // mutually exclusive read if (use_a) { rdata = a[raddr]; } else { rdata = b[raddr]; } return rdata; }
The code instructs the Intel® HLS Compiler Pro Edition to implement local memories a and b as two on-chip memory blocks, each with its own load and store instructions.
Because the load and store instructions for local memories a and b are mutually exclusive, you can merge the accesses, as shown in the example code below. Merging the memory accesses reduces the number of load and store instructions, and the number of on-chip memory blocks, by half.
component int depth_manual(bool use_a, int raddr, int waddr, int wdata) { int a[128] hls_merge("mem","depth"); int b[128] hls_merge("mem","depth"); int rdata; // mutually exclusive write if (use_a) { a[waddr] = wdata; } else { b[waddr] = wdata; } // mutually exclusive read if (use_a) { rdata = a[raddr]; } else { rdata = b[raddr]; } return rdata; }
There are cases where merging local memories with respect to depth might degrade memory access efficiency. Before you decide whether to merge the local memories with respect to depth, refer to the HLD report ( <result>.prj/reports/report.html) to ensure that they have produced the expected memory configuration with the expected number of loads and stores instructions. In the example below, the Intel® HLS Compiler Pro Edition should not merge the accesses to local memories a and b because the load and store instructions to each memory are not mutually exclusive.
component int depth_manual(bool use_a, int raddr, int waddr, int wdata) { int a[128] hls_merge("mem","depth"); int b[128] hls_merge("mem","depth"); int rdata; // NOT mutually exclusive write a[waddr] = wdata; b[waddr] = wdata; // NOT mutually exclusive read rdata = a[raddr]; rdata += b[raddr]; return rdata; }
In this case, the Intel® HLS Compiler Pro Edition might double pump the memory system to provide enough ports for all the accesses. Otherwise, the accesses must share ports, which prevent stall-free accesses.
7.3.2. Example: Merging Memories Width-Wise
Use the hls_merge("<mem_name>","width") attribute to force the Intel® HLS Compiler Pro Edition to implement variables in the same memory system, merging their memories by width.
All variables with the same <mem_name> label set in their hls_merge attributes are merged.
Consider the following component code:
component short width_manual (int raddr, int waddr, short wdata) { short a[256]; short b[256]; short rdata = 0; // Lock step write a[waddr] = wdata; b[waddr] = wdata; // Lock step read rdata += a[raddr]; rdata += b[raddr]; return rdata; }
In this case, the Intel® HLS Compiler Pro Edition can coalesce the load and store instructions to local memories a and b because their accesses are to the same address, as shown below.
component short width_manual (int raddr, int waddr, short wdata) { short a[256] hls_merge("mem","width"); short b[256] hls_merge("mem","width"); short rdata = 0; // Lock step write a[waddr] = wdata; b[waddr] = wdata; // Lock step read rdata += a[raddr]; rdata += b[raddr]; return rdata; }
7.4. Example: Specifying Bank-Selection Bits for Local Memory Addresses
The (b 0 , b 1 , ... ,b n ) arguments refer to the local memory address bit positions that the Intel® HLS Compiler Pro Edition should use for the bank-selection bits. Specifying the hls_bankbits(b 0, b 1 , ..., b n) attribute implies that the number of banks equals 2 number of bank bits .
Bank 0 | Bank 1 | Bank 2 | Bank 3 | |
Word 0 | 00 000 | 01 000 | 10 000 | 11 000 |
Word 1 | 00 001 | 01 001 | 10 001 | 11 001 |
Word 2 | 00 010 | 01 010 | 10 010 | 11 010 |
Word 3 | 00 011 | 01 011 | 10 011 | 11 011 |
Word 4 | 00 100 | 01 100 | 10 100 | 11 100 |
Word 5 | 00 101 | 01 101 | 10 101 | 11 101 |
Word 6 | 00 110 | 01 110 | 10 110 | 11 110 |
Word 7 | 00 111 | 01 111 | 10 111 | 11 111 |
Example of Implementing the hls_bankbits Attribute
component int bank_arbitration (int raddr, int waddr, int wdata) { #define DIM_SIZE 4 // Adjust memory geometry by preventing coalescing hls_numbanks(1) hls_bankwidth(sizeof(int)*DIM_SIZE) // Force each memory bank to have 2 ports for read/write hls_singlepump hls_max_replicates(1) int a[DIM_SIZE][DIM_SIZE][DIM_SIZE]; // initialize array a… int result = 0; #pragma unroll for (int dim1 = 0; dim1 < DIM_SIZE; dim1++) #pragma unroll for (int dim3 = 0; dim3 < DIM_SIZE; dim3++) a[dim1][waddr&(DIM_SIZE-1)][dim3] = wdata; #pragma unroll for (int dim1 = 0; dim1 < DIM_SIZE; dim1++) #pragma unroll for (int dim3 = 0; dim3 < DIM_SIZE; dim3++) result += a[dim1][raddr&(DIM_SIZE-1)][dim3]; return result; }
As illustrated in the following figure, this code example generates multiple load and store instructions, and therefore multiple load/store units (LSUs) in the hardware. If the memory system is not split into multiple banks, there are fewer ports than memory access instructions, leading to arbitrated accesses. This arbitration results in a high loop initiation interval (II) value. Avoid arbitration whenever possible because it increases the FPGA area utilization of your component and impairs the performance of your component.

By default, the Intel® HLS Compiler Pro Edition splits the memory into banks if it determines that the split is beneficial to the performance of your component. The compiler checks if any bits remain constant between accesses, and automatically infers bank-selection bits.
component int bank_no_arbitration (int raddr, int waddr, int wdata) { #define DIM_SIZE 4 // Adjust memory geometry by preventing coalescing and splitting memory hls_bankbits(4, 5) hls_bankwidth(sizeof(int)*DIM_SIZE) // Force each memory bank to have 2 ports for read/write hls_singlepump hls_max_replicates(1) int a[DIM_SIZE][DIM_SIZE][DIM_SIZE]; // initialize array a… int result = 0; #pragma unroll for (int dim1 = 0; dim1 < DIM_SIZE; dim1++) #pragma unroll for (int dim3 = 0; dim3 < DIM_SIZE; dim3++) a[dim1][waddr&(DIM_SIZE-1)][dim3] = wdata; #pragma unroll for (int dim1 = 0; dim1 < DIM_SIZE; dim1++) #pragma unroll for (int dim3 = 0; dim3 < DIM_SIZE; dim3++) result += a[dim1][raddr&(DIM_SIZE-1)][dim3]; return result; }
The following diagram shows that this example code creates a memory configuration with four banks. Using bits 4 and 5 as bank selection bits ensures that each load/store access is directed to its own memory bank.

In this code example, setting hls_numbanks(4) instead of hls_bankbits(4,5) results in the same memory configuration because the Intel® HLS Compiler Pro Edition automatically infers the optimal bank selection bits.
In the Function Memory Viewer (inf the High-Level Design Reports), the Address bit information shows the bank selection bits as b6 and b7, instead of b4 and b5:
This difference occurs because the address bits reported in the Function Memory Viewer are based on byte addresses and not element addresses. Because every element in array a is four bytes in size, bits b4 and b5 in element address bits correspond to bits b6 and b7 in byte addressing.
8. System of Tasks Best Practices
After you implement systems of tasks, you might want to balance the capacity of your task functions. For details, review the advice in Balancing Capacity in a System of Tasks.
8.1. Executing Multiple Loops in Parallel
component void foo() { // first loop for (int i = 0; i < n; i++) { // Do something } // second loop for (int i = 0; i < m; i++) { // Do something else } }
void first_loop() {
for (int i = 0; i < n; i++) {
// Do something
}
}
void second_loop() {
for (int i = 0; i < m; i++) {
// Do something else
}
}
component void foo() {
ihc::launch<first_loop>();
ihc::launch<second_loop>();
ihc::collect<first_loop>();
ihc::collect<second_loop>();
}
Review the tutorial <quartus_installdir>/hls/examples/tutorials/system_of_tasks/parallel_loop to learn more about how to run multiple loops in parallel.
8.2. Sharing an Expensive Compute Block
To allow for calls from multiple places to a task, the Intel® HLS Compiler Pro Edition generates arbitration logic to the called task function. This arbitration logic can increase the area utilization of the component. However, if the shared logic is large, the trade-off can help you save FPGA resources. The savings can be especially noticed when your component has a large compute block that is not always active.
Review the tutorial <quartus_installdir>/hls/examples/tutorials/system_of_tasks/resource_sharing to see a simple example of how to share a compute block in component.
8.3. Implementing a Hierarchical Design
If you do not use a system of tasks, function calls in your HLS component are in-lined and optimized together with the calling code, which can be detrimental in some situations. Use a system of tasks to prevent smaller blocks of your design from being affected by the rest of the system.
- Modularity similar to what a hardware description language (HDL) might provide
- Unpipelineable or poorly pipelined loops can be isolated so that they do not affect an entire loop nest.
8.4. Balancing Capacity in a System of Tasks
Typically, these performance issues are caused by a lack of capacity in the datapath of the functions calling task function using the ihc::launch and ihc::collect calls. You can improve system throughput in these cases by adding a buffer to the explicit streams to account for the latency of the task functions.
- <quartus_installdir>/hls/examples/tutorials/system_of_tasks/balancing_pipeline_latency
- <quartus_installdir>/hls/examples/tutorials/system_of_tasks/balancing_loop_delay
The Intel® HLS Compiler Pro Edition emulator models the size of the buffer attached to a stream. However, the emulator does not fully account for hardware latencies, and it might exhibit different behavior between simulation and emulation in these cases.
In addition to the techniques outlined in the tutorials, follow the practices that follow to try to maximize the data throughput of your design.
8.4.1. Enable the Intel HLS Compiler to Infer Data Path Buffer Capacity Requirements
In many situations, the Intel® HLS Compiler can add buffer capacity automatically to the data path in a system of tasks design to achieve maximum throughput for your design. Follow a few best practices to help the Intel® HLS Compiler effectively add data path buffer capacity to your design when needed.
component foo() { // Parse/compute data for tasks ihc::launch<task1>(data1); ihc::launch<task2>(data2); auto r1 = ihc::collect<task1>(); auto r2 = ihc::collect<task2>(); // Usage of r1, r2 }
In this diagram, Entry represents the two independent launch calls, and the Exit represents the two independent collect calls.
Entry provides work to both tasks only if both tasks can take in data (that is, both task have available buffer capacity). Similarly, Exit consumes the results only when both results are available.
If Task1 and Task2 have the same number of pipeline stages, then the data path performs at full throughput. Some data path buffer capacity is needed in the caller function to ensure that the caller can continue issuing launch calls while the collect calls wait for the task functions to complete. The compiler adds this data path buffer capacity automatically.
If the two tasks have different pipeline depths, then the design encounters a bottleneck because the task with the smaller pipeline depth lacks the buffer capacity to store finished results while waiting for the other task to finish. In this case, you can add buffer capacity to either launch or the collect call of the task with the smaller pipeline depth. For details about adding launch/collect buffer capacity, see Explicitly Add Buffer Capacity to Your Design When Needed.
The Intel® HLS Compiler tries to balance data path buffer capacity automatically, but it can only add data path capacity automatically when your design follows certain practices.
- A component or task function should do one of the
following things:
- Do all of the work by itself without launching other tasks.
- Act as an orchestrator for issuing ihc::launch or ihc::collect calls and do none of the work.
- If throughput is a priority for your design, avoid using multiple ihc::launch or ihc::collect calls to the same task function unless you are reusing the calls to the function by iterating in a loop.
- Keep ihc::launch
and ihc::collect calls to the same
task function within the same block.
Review the block structure of your design with the Graph Viewer in the High-Level Design Reports to confirm that your calls are in the same block.
- Avoid guarding your ihc::launch and ihc::collect calls with an if-condition.
If you are guarding your ihc::launch and ihc::collect calls with an if-condition, use the same if-condition for both the ihc::launch and ihc::collect calls.
8.4.2. Explicitly Add Buffer Capacity to Your Design When Needed
When the Intel® HLS Compiler cannot infer the optimal capacity requirements, you can explicitly add buffer capacity to your design by specifying a value for the capacity parameter of the ihc::launch and ihc::collect functions.
Adding Capacity When Launching Task Functions
- Any back-pressure introduced by the task function
- How often the caller launches the task function
Figure 43 show the block diagram of such a design.
Setting the capacity parameter in an ihc::launch call inserts a buffer that allows the caller to push work onto the task function which is then free to pull work off that queue when it can.
Adding Capacity When Collecting Task Functions
- The cadence of data production in the task function
- The cadence reading that data by the caller function
Setting the capacity parameter in an ihc::collect call inserts a buffer that can hold the return values as they are computed by the task function. The caller function is free to pull the return values from the buffer at a convenient later time without causing backpressure on the task function.
9. Datatype Best Practices
After you optimize the algorithm bottlenecks of your design, you can fine-tune some datatypes in your component by using arbitrary precision datatypes to shrink data widths, which reduces FPGA area utilization. The Intel® HLS Compiler Pro Edition provides debug functionality so that you can easily detect overflows in arbitrary precision datatypes.
Because C++ automatically promotes smaller datatypes such as short or char to 32 bits for operations such as addition or bit-shifting, you must use the arbitrary precision datatypes if you want to create narrow datapaths in your component.
Tutorials Demonstrating Datatype Best Practices
The Intel® HLS Compiler Pro Edition comes with a number of tutorials that illustrate important Intel® HLS Compiler concepts and demonstrate good coding practices.
Tutorial | Description |
---|---|
You can find
these tutorials in the following location on your
Intel®
Quartus® Prime
system:<quartus_installdir>/hls/examples/tutorials |
|
best_practices/ac_datatypes | Demonstrates the effect of using ac_int datatype instead of int datatype. |
ac_datatypes/ac_fixed_constructor | Demonstrates the use of the ac_fixed constructor where you can get a better QoR by using minor variations in coding style. |
ac_datatypes/ac_int_basic_ops | Demonstrates the operators available for the ac_int class. |
ac_datatypes/ac_int_overflow | Demonstrates the usage of the DEBUG_AC_INT_WARNING and DEBUG_AC_INT_ERROR keywords to help detect overflow during emulation runtime. |
best_practices/single_vs_double_precision_math | Demonstrates the effect of using single precision literals and functions instead of double precision literals and functions. |
ac_datatypes/ ac_fixed_constructor | Demonstrates the use of the ac_fixed math library functions. |
hls_float/ 1_reduced_double | Demonstrates how your applications can benefit from changing the underlying type from double to hls_float<11,44> (reduced double). |
hls_float/ 2_explicit_arithmetic | Demonstrates how to use explicit versions of hls_float binary operations to perform floating-point arithmetic operations based on your needs. |
hls_float/ 3_conversions | Demonstrates when conversions appear in designs that use the hls_float data type and how to take advantage of different conversion modes to generate compile-type constants using hls_float types. |
9.1. Avoid Implicit Data Type Conversions
Using this option helps you avoid inadvertently having conversions between double-precision and single-precision values when double-precision variables are not needed. In FPGAs, using double-precision variables can negatively affect the data transfer rate, the latency, and resource utilization of your component.
Additionally, constants are treated as signed int or signed double. If you want efficient operations with narrower constants, cast constants to other, narrower data types like ac_int<> or float.
If you use the Algorithmic C (AC) arbitrary precision datatypes, pay attention to the type propagation rules.
9.2. Avoid Negative Bit Shifts When Using the ac_int Datatype
The ac_int datatype differs from other languages, including C and Verilog, in bit shifting. By default, if the shift amount is of a signed datatype ac_int allows negative shifts.
In hardware, this negative shift results in the implementation of both a left shifter and a right shifter. The following code example shows a shift amount that is a signed datatype.
int14 shift_left(int14 a, int14 b) { return (a << b); }
If you know that the shift is always in one direction, to implement an efficient shift operator, declare the shift amount as an unsigned datatype as follows:
int14 efficient_left_only_shift(int14 a, uint14 b) { return (a << b); }
10. Advanced Troubleshooting
- Your component behaves differently in simulation and emulation.
- Your component has unexpectedly poor performance, resource utilization, or both.
10.1. Component Fails Only In Simulation
Comparing Floating Point Results
Use an epsilon when comparing floating point value results in the testbench. Floating points results from the RTL hardware are different from the x86 emulation flow.
Using #pragma ivdep to Ignore Memory Dependencies
The #pragma ivdep compiler pragma can cause functional incorrectness in your component if your component has a memory dependency that you attempted to ignore with the pragma. You can try to use the safelen modifier to control how many memory accesses that you can permit before a memory dependency occurs.
See Loop-Carried Dependencies (ivdep Pragma) in Intel® High Level Synthesis Compiler Pro Edition Reference Manual for a description of this pragma.
To see an example of using the ivdep pragma, review the tutorial in <quartus_installdir>/hls/examples/tutorials/best_practices/loop_memory_dependency.
Check for Uninitialized Variables
Many coding practices can result in behavior that is undefined by the C++ specification. Sometimes this undefined behavior works one way in emulation and a different way in simulation.
A common example of this situation occurs when your design reads from uninitialized variables, especially uninitialized struct variables.
Check your code for uninitialized values with the -Wuninitialized compiler flag, or debug your emulation testbench with the valgrind debugging tool. The -Wuninitialized compiler flag does not show uninitialized struct variables.
You can also check for misbehaving variables by using one or more stream interfaces as debug streams. You can add one or more ihc::stream_out interfaces to your component to have the component write out its internal state variables as it executes. By comparing the output of the emulation flow and the simulation flow, you can see where the RTL behavior diverges from the emulator behavior.
Non-blocking Stream Accesses
The emulation model of tryRead() is not cycle-accurate, so the behavior of tryRead() might differ between emulation and simulation.
If you have a non-blocking stream access (for example, tryRead()) from a stream with a FIFO (that is, the ihc::depth<> template parameter), then the first few iterations of tryRead() might return false in simulation, but return true in emulation.
In this case, invoke your component a few extra times from the testbench to guarantee that it consumes all data in the stream. These extra invocations should not cause functional problems because tryRead() returns false.
10.2. Component Gets Poor Quality of Results
The information in this section describes some common sources of stallable arbitration nodes or excess RAM utilization.
Component Uses More FPGA Resource Than Expected
By default, the Intel® HLS Compiler Pro Edition tries to optimize your component for the best throughput by trying to maximize the maximum operating frequency (fMAX).
A way to reduce area consumption is to relax the fMAX requirements by setting a target fMAX value with the --clock i++ command option or the hls_scheduler_target_fmax_mhz component attribute. The HLS compiler can often achieve a higher fMAX than you specify, so when you set a target fMAX to a lower value than you need, your design might still achieve an acceptable fMAX value, and a design that consumes less area.
To learn more about the behavior of fMAX target value control see the following tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/set_component_target_fmax
Incorrect Bank Bits
If you access parts of an array in parallel (either a single- or multidimensional array), you might need to configure the memory bank selection bits.
See Memory Architecture Best Practices for details about how to configure efficient memory systems.
Conditional Operator Accessing Two Different Arrays of struct Variables
In some cases, if you try to access different arrays of struct variables with a conditional operator, the Intel® HLS Compiler Pro Edition merges the arrays into the same RAM block. You might see stallable arbitration in the Function Memory Viewer because there are not enough Load/Store site on the memory system.
struct MyStruct { float a; float b; } MyStruct array1[64]; MyStruct array2[64];
MyStruct value = (shouldChooseArray1) ? array1[idx] : array2[idx];
MyStruct value; if (shouldChooseArray1) { value = array1[idx]; } else { value = array2[idx]; }
Cluster Logic
Your design might consume more RAM blocks than you expect, especially if you store many array variables in large registers.
You can use the hls_use_stall_enable_clusters component attribute to prevent the compiler from inserting stall-free cluster exit FIFOs.
The Area Analysis of System report in the high-level design report (report.html) can help find this issue.

The three matrices are stored intentionally in RAM blocks, but the RAM blocks for the matrices account for less than half of the RAM blocks consumed by the component.
If you look further down the report, you might see that many RAM blocks are consumed by Cluster logic or State variable. You might also see that some of your array values that you intended to be stored in registers were instead stored in large numbers of RAM blocks.

Notice the number of RAM blocks that are consumed by Cluster Logic and State.
- Pipeline loops instead of unrolling them.
- Storing local variables in local RAM blocks (hls_memory memory attribute) instead of large registers (hls_register memory attribute).
A. Intel HLS Compiler Pro Edition Best Practices Guide Archives
B. Document Revision History for Intel HLS Compiler Pro Edition Best Practices Guide
Document Version | Intel® HLS Compiler Pro Edition Version | Changes |
---|---|---|
2020.12.14 | 20.4 |
|
2020.09.28 | 20.3 |
|
2020.06.22 | 20.2 |
|
2020.04.13 | 20.1 |
|
2020.01.27 | 19.4 |
|
2019.12.16 | 19.4 |
|
Document Revision History for Intel® HLS Compiler Best Practices Guide
Previous versions of the Intel® HLS Compiler Best Practices Guide contained information for both Intel® HLS Compiler Standard Edition and Intel® HLS Compiler Pro Edition.
Document Version | Intel® Quartus® Prime Version | Changes |
---|---|---|
2019.09.30 | 19.3 |
|
2019.07.01 | 19.2 |
|
2019.04.01 | 19.1 |
|
2018.12.24 | 18.1 |
|
2018.09.24 | 18.1 |
|
2018.07.02 | 18.0 |
|
2018.05.07 | 18.0 |
|
2017.12.22 | 17.1.1 |
|
2017.11.06 | 17.1 | Initial
release. Parts of this book consist of content previously found in the Intel® High Level Synthesis Compiler User Guide and the Intel® High Level Synthesis Compiler Reference Manual. |