Intel High Level Synthesis Compiler Pro Edition: Reference Manual
Version Information
Updated for: |
---|
Intel® Quartus® Prime Design Suite 20.4 |
1. Intel HLS Compiler Pro Edition Reference Manual
The Intel® HLS Compiler Pro Edition Reference Manual provides reference information about the features supported by the Intel® HLS Compiler Pro Edition. The Intel® HLS Compiler is sometimes referred to as the i++ compiler, reflecting the name of the compiler command.
In this publication, <quartus_installdir> refers to the location where you installed Intel® Quartus® Prime Design Suite.
- Windows
- C:\intelFPGA_pro\20.4
- Linux
- /home/<username>/intelFPGA_pro/20.4
About the Intel® HLS Compiler Pro Edition Documentation Library
Title and Description | |
---|---|
Release
Notes
Provide late-breaking information about the Intel® HLS Compiler. |
Link |
Getting Started
Guide
Get up and running with the Intel® HLS Compiler by learning how to initialize your compiler environment and reviewing the various design examples and tutorials provided with the Intel® HLS Compiler. |
Link |
User Guide
Provides instructions on synthesizing, verifying, and simulating intellectual property (IP) that you design for Intel FPGA products. Go through the entire development flow of your component from creating your component and testbench up to integrating your component IP into a larger system with the Intel Quartus Prime software. |
Link |
Reference
Manual
Provides reference information about the features supported by the Intel HLS Compiler. Find details on Intel® HLS Compiler command options, header files, pragmas, attributes, macros, declarations, arguments, and template libraries. |
Link |
Best Practices
Guide
Provides techniques and practices that you can apply to improve the FPGA area utilization and performance of your HLS component. Typically, you apply these best practices after you verify the functional correctness of your component. |
Link |
Quick
Reference
Provides a brief summary of Intel HLS Compiler declarations and attributes on a single two-sided page. |
Link |
2. Compiler
2.1. Intel HLS Compiler Pro Edition Command Options
Command Option | Description |
---|---|
--debug-log | Instructs the compiler to generate a log file
that contains diagnostic information. By default, the debug.log file is in the a.prj subdirectory within your current working directory. If you also include the -o <result> command option, the debug.log file will be in the <result>.prj subdirectory. If your compilation fails, the debug.log file is generated whether you set this option or not. |
-h or --help | Instructs the compiler to list all the command options and their descriptions on screen. |
-o <result> | Instructs the compiler to place its output into
the
<result>
executable and the
<result>.prj directory. If you do not specify the -o <result>option, the compiler outputs an a.out file for Linux and an a.exe file for Windows. Use the -o <result> command option to specify the name of the compiler output. Example command: i++ -o hlsoutput multiplier.c Invoking this example command creates an hlsoutput executable for Linux and an hlsoutput.exe for Windows in your working directory. |
-v | Verbose mode that instructs the compiler to
display messages describing the progress of the compilation. Example command: i++ -v multiplier.cpp, where multiplier.cpp is the input file. |
--version | Instructs the compiler to display its version
information on screen. Command: i++ --version |
Option | Description |
---|---|
-c | Instructs the compiler to preprocess, parse, and
generate object files (.o/.obj) in
the current working directory. The linking stage is omitted. Example command: i++ -march="Arria 10" -c multiplier.c Invoking this example command creates a multiplier.o file and sets the name of the <result>.prj directory to multiplier.prj. When you later link the .o file, the -o option affects only the name of the executable file. The name of the <result>.prj directory remains unchanged from when the directory name was set by i++ -c command invocation. |
--component <components> | Allows you to specify a comma-separated list of
function names that you want to the compiler to synthesize to RTL. Example command: i++ counter.cpp --component count To use this option,
your component must be configured with C-linkage using the
extern "C" specification.
For
example:
extern "C" int myComponent(int a, int b) Using the component function attribute is preferred over using the --component command option to indicate functions that you want the compile to RTL. |
-D <macro> [= <val> ] | Allows you to pass a macro definition (<macro>) and its value
(<val>) to the
compiler. If you do not a specify a value for <val>, its default value will be 1. |
-g | Generate debug information (default). |
-g0 | Do not generate debug information. |
--gcc-toolchain=<GCC_dir> |
Specifies the path to a GCC installation that you want to use for compilation. This path should be the absolute path to the directory that contains the GCC lib, bin, and include folders. You should not need to use this if you configured your system as described in the Getting Started Guide. |
--hyper-optimized-handshaking=[auto|off] |
This option applies to Intel® Agilex™ and Intel® Stratix® 10 devices only. Use this option to modify the handshaking protocol used in certain areas of your design. By default, the --hyper-optimized-handshaking option is set to auto. When you enable this optimization, the compiler adds pipeline registers to the handshaking paths of the stallable nodes. This optimization results in a higher fMAX at the cost of increased area and latency due to the added registers. Disabling this optimization typically decreases area and latency at the cost of lower fMAX. Restriction: This option applies only to
designs targeting
Intel®
Agilex™
and
Intel®
Stratix® 10 devices. If you use this
option when you target devices other than
Intel®
Agilex™
and
Intel®
Stratix® 10
devices, the compiler exits with an error.
|
-I <dir> | Adds a directory (<dir>) to the end of the include path list. |
-march= [x86-64 | <FPGA_family> | <FPGA_part_number>] | Instructs the compiler to compile the component
to the specified architecture or FPGA family. The
-march compiler option can
take one of the following values:
If you do not specify this option, -march=x86-64 is assumed. If the parameter value that you specify contains spaces, surround the parameter value in quotation marks. |
--quartus-compile | Compiles your design with the
Intel®
Quartus® Prime compiler. Example command: i++ --quartus-compile <input_files> ‑march="Arria 10" When you specify this option, the Intel® Quartus® Prime compiler is run after the RTL is generated. The compiled Intel® Quartus® Prime project is put in the <result>.prj/quartus directory and a summary of the FPGA resource consumption and maximum clock frequency is added to the high level design reports in the <result>.prj/reports directory. This compilation is intended to estimate the best achievable fMAX for your component. Your component is not expected to cleanly close timing in the reports. |
--quartus-seed <seed> | Specifies the seed value that is used by
Intel®
Quartus® Prime project located in the
<result>.prj/quartus
directory. This seed value is used by the Intel® Quartus® Prime Fitter for initial placement configuration when optimizing design placement to meet timing requirements (fMAX). |
--simulator <simulator_name> | Specifies the simulator you are using to perform
verification. This command option can take
the following values for <simulator_name>:
If you do not specify this option, --simulator modelsim is assumed. Important: The --simulator command option only works in
conjunction with the -march
command option.
The --simulator none option instructs the HLS compiler to skip the verification flow and generate RTL for the components without generating the corresponding test bench. If you use this option, the high-level design report (report.html) is generated more quickly but you cannot simulate your design. Without data from simulation, the report must omit verification statistics such as component latency. Example command: i++ -march="<FPGA_family_or_part_number>" ‑‑simulator none multiplier.c |
-ffp-contract=fast |
Remove intermediate rounding and conversion when
possible, except for code blocks fenced by #pragma clang fp contract(off). To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/floating_point_ops |
-ffp-reassoc |
Relax the order of floating point arithmetic
operations, except for code blocks fenced by #pragma clang fp reassoc(off)
To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/floating_point_ops |
--daz | For double
data types only, disable subnormal support in native IEEE-754
double-precision floating-point computations. To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/submnormal_and_rounding |
--rounding= [ieee | faithful] |
For double data types only, control rounding scheme for native IEEE-754 double-precision adders, multipliers, and dividers. If you do not specify this option, adders and multipliers use IEEE-754 round to nearest, ties to even (RNE) rounding (0.5 ULP) and dividers use faithful rounding (1 ULP). The --rounding option can take one of the following values:
To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/submnormal_and_rounding |
--clock clock target> | Optimizes the RTL for the specified clock
frequency or period. The clock target value must include a unit. For example:
i++ -march="Arria 10" test.cpp --clock 100MHz i++ -march="Arria 10" test.cpp --clock 10ns |
Option | Description |
---|---|
-ghdl[=<depth>] |
Logs signals when running the verification executable to help you debug the generated RTL. After running the executable, the simulator logs waveforms to the a.prj/verification/vsim.wlf file. Use the optional <depth> attribute to specify how many levels of hierarchy are logged. If you do not specify a value for the <depth> attribute, all signals are logged. Use -ghdl=1 to log only the top-level signals. For details about the ModelSim* waveform, see Debugging during Verification in Intel® High Level Synthesis Compiler Pro Edition User Guide. |
-L
<dir>
-L <dir> |
(Linux only) Adds a directory (<dir>) to the end of the search path for the library files. |
-l <library> | (Linux only) Specifies the library file
(.a) name when linking
the object file to the binary. On Windows, you can list library files (.lib) on the command line without specifying any command options or flags. |
--x86-only | Creates only the testbench executable. The compiler outputs an <result> file for Linux or a <result>.exe file for Windows. The <result>.prj directory and its contents are not created. |
--fpga-only | Creates only the
<result>.prj directory and its
contents. The testbench executable file ( <result> / <result>.exe) is not created. Before you can simulate your hardware from a compilation output that uses this option, you must compile your testbench with the --x86-only option (or as part of a full compilation). |
2.2. Using Libraries in Your Component
Use libraries to reuse functions created by you or others without needing to know the function implementation details.
To use the functions in a library, you must have the C/C++ header files (.h or .hpp) for the library available. For object libraries, you must also have the object library archive file (.a on Linux systems or .lib on Windows systems) available.
Any object libraries that you use in your component must be built and used by the same version number Intel FPGA high-level design tool. For example, to compile your component with the Intel HLS Compiler Version 20.4, the libraries included in your component must have been created with a version 20.4 Intel FPGA high-level design tool. If you use a library with a different version, you get a version mismatch error when you compile your component.
To include a library in your component:
-
Review the header files corresponding to the
library that you want to include in your component.
The header file shows you the functions available to call in the library and how to call the functions.
-
Include the header files in your component
code.
For example, #include "primitives.h"
-
Compile your component with the
Intel® HLS Compiler as
follows::
- For source code (that is,
header-only) libraries, there is no additional library
file name to specify.
For example, i++ -march=arria10 MyComponent.cpp
- For object libraries, ensure that you add the
object library archive file name to the i++ command.
For example, i++ -march=arria10 MyComponent.cpp libprim.a (Linux) or i++ -march=arria10 MyComponent.cpp libprim.lib (Windows).
- For source code (that is,
header-only) libraries, there is no additional library
file name to specify.
2.3. Compiler Interoperability
GCC/MSVC | i++ | |
---|---|---|
Testbench | X | X |
Component (emulation) | X | X |
Component (RTL) | X |
To see what versions of GCC and Microsoft Visual Studio the Intel® HLS Compiler supports, see " Intel® High Level Synthesis Compiler Prerequisites" in Intel® High Level Synthesis Compiler Getting Started Guide.
The interoperability between GCC or Microsoft Visual Studio, and the Intel® HLS Compiler lets you decouple your testbench development from your component development. Decoupling your testbench development can be useful for situations where you want to iterate your testbench quickly with platform-native compilers (GCC/Microsoft Visual Studio), without having to recompile the RTL generated for your component.
To create only your testbench executable with the i++ command, specify the --x86-only option.
You can choose to only generate RTL and simulation support for your component by linking the object file or files for your component with the Intel® High Level Synthesis Compiler.
To generate only your RTL and simulation support for your component, specify the --fpga-only option.
To use a native compiler (GCC or Microsoft Visual Studio) to compile your Intel® HLS Compiler code, you must run your native compiler from a terminal session where you initialized the environment for the Intel® HLS Compiler. The initialization script chooses the correct native compiler for you system.
GCC
- Initialize your environment with the
Intel® HLS Compiler initialization
script:
<quartus_installdir>/hls/init_hls.sh
- Add the path to the
Intel® HLS Compiler header files to the g++ command include path.
The header files are in the quartus_installdir/hls/include directory.
- Add the path to the HLS emulation library to the
linker search path.
The emulation library is in the quartus_installdir/hls/host/linux64/lib directory.
- Add the hls_emul library to the linker command (that is, specify -lhls_emul as a command option).
- Ensure that you specify the -std=c++17 option of the g++ command.
- If you want to generate debug symbols, specify the -g option of the g++ command.
- If you are using HLS tasks in a system of tasks (ihc::launch and ihc:collect), specify the -pthread option of the g++ command.
- If you are using arbitrary precision datatypes, include the reference
version in your
source code instead of the FPGA-optimized version provided with the
Intel® HLS Compiler. You can use the __INTELFPGA_COMPILER__ macro to control which variant is
included. For example, if you are using arbitrary precision integers, you can use the
following macro
code
#ifdef __INTELFPGA_COMPILER__ #include "HLS/ac_int.h" #else #include "ref/ac_int.h" #endif
g++ myFile.cpp -g -I"$(HLS_INSTALL_DIR)/include" -L"$(HLS_INSTALL_DIR)/host/linux64/lib" -lhls_emul -pthread -std=c++17
Microsoft Visual C++
The following instructions were tested with Microsoft Visual Studio 2017 Professional.
To compile your Intel® HLS Compiler code with Microsoft Visual C++:
- Initialize your environment with the
Intel® HLS Compiler initialization
script:
<quartus_installdir>/hls/init_hls.bat
- Add the
Intel® HLS Compiler header files to the compiler command
include path.
The header files are in the quartus_installdir\hls\include directory.
- Add the /Zi option to generate debug symbols when compiling.
- Add the /wd4068 option to suppress warnings because MSVC does not recognize the Intel® HLS Compiler pragmas.
- Add the HLS emulation library to the linker
search path.
The emulation library is in the quartus_installdir\hls\host\windows64\lib directory.
- Add the hls_emul library to the linker command.
- If you are using arbitrary precision datatypes,
include the reference version instead of the FPGA-optimized version
provided with the
Intel® HLS Compiler.
You can use the __INTELFPGA_COMPILER__ macro to control which version is
included:
#ifdef __INTELFPGA_COMPILER__ #include "HLS/ac_int.h" #else #include "ref/ac_int.h" #endif
cl myFile.cpp /I "%HLS_INSTALL_DIR%\include" /nologo /EHsc /wd4068 /MD /std:c++17 /Zi /link "/libpath:%HLS_INSTALL_DIR%\host\windows64\lib" hls_emul.lib
2.4. Intel HLS Compiler Hardware Model
The Intel® HLS Compiler attempts to pipeline functions as much as possible. Different stages of the pipeline might have multiple operations performed in parallel.
The following figure shows an example of the pipeline architecture generated by the Intel® HLS Compiler. The numbered operations on the right side represent the pipeline implementation of the C++ code on the left side of the figure. Each box in the right side of the figure is an operation in the pipeline.
With a pipelined approach, multiple invocations of the component can be simultaneously active. For example, the earlier figure shows that the first invocation of the component can be returning a result at the same time the fourth invocation of the component is called.
One invocation of a component advances to the its next stage in the pipeline only after all of the operations of its current stage are complete.
Some operations can stall the pipeline. A common example of operations that can stall a pipeline is a variable latency operation like a memory load or store operation. To support pipeline stalls, the Intel® HLS Compiler propagates ready and valid signals through the pipeline to all operations that have a variable latency.
For operations that have a fixed latency, the Intel® HLS Compiler can statically schedule the interaction between the operations and ready signals are not needed between the stages with fixed latency operations. In these cases, the compiler optimizes the pipeline to statically schedule the operations, which significantly reduces the logic required to implement the pipeline.
3. C Language and Library Support
3.1. Supported C and C++ Subset for Component Synthesis
- Dynamic memory allocation.
- Virtual functions.
- Function pointers
- C++ or C library functions, except the supported math functions explicitly mentioned in Supported Math Functions.
- Non-static class functions.
- Template functions without an explicit specialization.
In addition, a component or task function cannot contain an irreducible loop. That is, loops in component and task functions must have only one entry point into the loop.
3.2. C and C++ Libraries
HLS Header File | Description |
---|---|
HLS/hls.h | Required for component identification and component parameter interfaces. |
HLS/math.h | Includes FPGA-specific definitions for the math functions from the math.h for your operating system. |
HLS/extendedmath.h | Includes additional FPGA-specific definitions of math functions not in math.h. |
HLS/ac_int.h | Provides FPGA-optimized arbitrary width integer support. |
HLS/ac_fixed.h |
Provides FPGA-optimized arbitrary precision fixed point support. |
HLS/ac_fixed_math.h |
Provides FPGA-optimized arbitrary precision fixed point math functions. |
HLS/ac_complex.h | Provides FPGA-optimized complex number support. |
HLS/hls_float.h | Provides FPGA-optimized arbitrary-precision IEEE-754 compliant floating-point number support. |
HLS/hls_float_math.h | Provides FPGA-optimized floating-point math functions. |
HLS/stdio.h | Provides printf support for components so that printf statements work in x86 emulations, but are disabled in component when compiling to an FPGA architecture. |
<iostream> | To use cout and cerr in your component, guard the statements with the HLS_SYNTHESIS macro. |
math.h
To access functions in math.h from your component function, include the "HLS/math.h" file in your source code. The header ensures that the components call the hardware versions of the math functions.
For more information about supported math.h functions, see Supported Math Functions.
stdio.h
Synthesized component functions generally do not support C and C++ standard library functions such as FILE pointers.
A component can call printf by including the header file HLS/stdio.h. This header changes the behavior of printf depending on the compilation target:
- For compilation that targets the x86-64 architecture (that is, -march=x86-64 ), the printf call behaves as normal.
- For compilation that targets the FPGA architecture (that is, ‑march="<FPGA_family_or_part_number>"), the compiler removes the printf call.
If you use printf in a component function without first including the #include "HLS/stdio.h" line in your code, you get an error message similar to the following error when you compile hardware to the FPGA architecture:
$ i++ -march="<FPGA_family_or_part_number>" --component dut test.cpp Error: HLS gen_qsys FAILED. See ./a.prj/dut.log for details.
You can use C and C++ standard library functions such as fopen and printf as normal in all testbench functions.
iostream
#include "HLS/hls.h" #include <iostream> component int debug_component (int a){ #ifndef HLS_SYNTHESIS std::cout << "input value: " << a << std::endl; #endif return a; }
If you attempt to use cout or cerr in a component function without guarding the line in your code with the HLS_SYNTHESIS macro , you get an error message similar to the following error when you compile hardware to the FPGA architecture:
$ i++ -march="<FPGA_family_or_part_number>" run.cpp run.cpp:5: Compiler Error: Cannot synthesize std::cout used inside of a component. HLS Main Optimizer FAILED.
3.3. Templated and Overloaded Functions
3.3.1. Templated Functions
Templated Functions as an HLS Component
When you create a template function, you must declare the instantiation of the function to synthesize into hardware.
template <typename T, int MULT> T multadd (T a, T b) { return MULT * (a + b); }
template component int multadd<int, 5>(int a, int b);
This declaration combined with the earlier template definition marks the int variant with MULT=5 of the multiadd function to be generated into a component. This component can now be invoked from the testbench.
Templated Functions as an HLS Task
If you want to use the function as a task in a system of tasks, use the ihc::launch and ihc::collect calls as shown in the following example:
component void foo () { int a, b; ihc::launch<multadd<int, 5>> (a, b); int res = ihc::collect<multadd<int, 5>>(); }
3.3.2. Overloaded Functions
To overload a component function, define multiple variants of the function.
component int mult (int a, int b) { return a * b; } component float mult (float a, float b) { return a * b; }
3.3.3. Function Name Mapping
A mapping of the full function declaration to the synthesized function name is provided in the summary page of the High-Level Design Reports (report.html). The synthesized function name is used for all the other reports such as the loops report and area analysis.
The following example shows an example of this table in
the report:
3.4. Intel HLS Compiler Pro Edition Compiler-Defined Preprocessor Macros
Tool Invocation | __INTELFPGA_COMPILER__ |
---|---|
g++ or cl | Undefined |
i++ -march=x86-64 | 2040 |
i++ -march="<FPGA_family_or_part_number>" | 2040 |
Tool Invocation | HLS_SYNTHESIS | |
---|---|---|
Testbench Code | HLS Component Code | |
g++ or cl | Undefined | Undefined |
i++ -march=x86-64 | Undefined | Undefined |
i++ -march="<FPGA_family_or_part_number>" | Undefined | Defined |
4. Component Interfaces
The component invocation interface is common to all HLS components and contains the return data (for non-void functions) and handshake signals for invoking the component, and for receiving results back when the component finishes executing.
Use the parameter interface to transfer data in and out of your component function. The parameter interface for your component is based on the parameters that you define in your component function signature and global variables (including global streams) that your component accesses.
4.1. Component Invocation Interface
- A call interface that consists of start and busy signals. The call interface is sometimes referred to as the do stream.
- A return interface that consists of done and stall signals. The return interface is sometimes referred to as the return stream.
- Return data if the component function has a return type that is not void
Alternatively, by declaring your component as an hls_avalon_slave_component component, your component can have signals registered in the component slave memory map instead. In an hls_avalon_slave_component component, the start, done, and returndata signals appear in the component control and status registers (CSR) instead of as conduits outside of the component.
For a comparison of the invocation interfaces, see Interface Definition Example: Component Invocation Interface Control Attributes.
For an example of a component interface with scalar and pointer arguments, see Figure 2.
Interfaces and Generated RTL
Component Invocation Interface Control Attributes
You can indicate the control signals that correspond to the actions of calling your component by using one of the component invocation interface attributes.
Unless a component parameter is marked stable (with the hls_stable_argument attribute), the component parameter inputs are synchronized according to this component invocation protocol.
Control Attribute | Description |
---|---|
hls_avalon_streaming_component |
This is the default component invocation interface. The component uses start, busy, stall, and done signals for handshaking. |
hls_avalon_slave_component | The start, done, and returndata (if applicable) signals appear in the component CSR instead of as conduits outside of the signal. |
hls_always_run_component | The start signal is tied to 1 internally in the component. There is no done signal output. |
hls_stall_free_return | If the downstream component never stalls, the stall signal is removed by internally setting it to 0. |
4.1.1. Scalar Parameters
The inputs are read into the component when the external system pulls the start signal high and the component keeps the busy signal low.
If your component has a return value, an output conduit that is synchronized to your component done signal is generated.
For an example of how to specify a scalar parameter and how it is read in by a component, see the a argument in Figure 2 and Figure 3.
4.1.2. Pointer and Reference Parameters
You can customize these pointer interfaces using the mm_master<> class.
For details about Avalon® (MM) Master interfaces, see Avalon Memory-Mapped Master Interfaces.
4.1.3. Interface Definition Example: Component Invocation Interface Control Attributes
This example compares two simple components. One component is implemented with simple conduits as its signal interface, while the other component is implemented as an hls_avalon_slave_component.
component float myComponent(float a, float b) { return a+b; }
This code example results in a component with the
following signals:
hls_avalon_slave_component component float myComponent( hls_avalon_slave_register_arg float a, hls_avalon_slave_register_arg float b) { return a+b; }
This code results in a component with the following
signals: |
4.1.4. Interface Definition Example: Component with Both Scalar and Pointer Arguments
component int dut(int a, int* b, int i) { return a*b[i]; }

If the dut component raises the busy signal, the caller needs to keep the start signal high and continue asserting the input arguments. Similarly, if the component downstream of dut raises the stall signal, then dut holds the done signal high until the stallsignal is de-asserted.
4.2. Avalon Streaming Interfaces
A component can have input and output streams that conform to the Avalon Streaming (ST) interface specifications. These input and output streams are represented by passing references to ihc::stream_in<> and ihc::stream_out<> objects as function arguments to the component.
When you use an Avalon ST interface, you can serialize the data over several clock cycles. That is, one component invocation can read from a stream multiple times.
You cannot derive new classes from the stream classes or encapsulate them in other formats such as structs or arrays. However, you may use references to instances of these classes as references inside other classes, meaning that you can create a class that has a reference to a stream object as a data member.
A component can have multiple read sites for a stream. Similarly, a component can have multiple write sites for a stream. However, for best component performance try to restrict each input stream in your design to a single read site and each output stream to a single write site.
Streaming Input Interfaces
Template Object or Parameter | Description |
---|---|
ihc::stream_in | Streaming input interface to the component. |
ihc::buffer | Specifies the capacity (in words) of the FIFO buffer on the input data that associates with the stream. |
ihc::readyLatency | Specifies the number of cycles between when the ready signal is deasserted and when the input stream can no longer accept new inputs. |
ihc::bitsPerSymbol | Describes how the data is broken into symbols on the data bus. |
ihc::firstSymbolInHighOrderBits | Specifies whether the data symbols in the stream are in big endian order. |
ihc::usesPackets | Exposes the startofpacket and endofpacket sideband signals on the stream interface. |
ihc::usesEmpty |
Exposes the empty out-of-band signal on the stream interface. |
ihc::usesValid | Controls whether a valid signal is present on the stream interface. |
Function API | Description |
---|---|
T read() | Blocking read call to be used from within the component |
T read(bool& sop, bool& eop) |
Available only if usesPackets<true> is set. Blocking read with out-of-band startofpacket and endofpacket signals. |
T read(bool& sop, bool& eop, int& empty) | Available only if
usesPackets<true> and
usesEmpty<true> are
set. Blocking read with out-of-band startofpacket, endofpacket, and empty signals. |
T tryRead(bool &success) | Non-blocking read call
to be used from within the component. The
success bool is set to true if
the read was valid. That is, the
Avalon®
-ST
valid
signal was high when the component tried to
read from the stream. The emulation model of tryRead() is not cycle-accurate, so the behavior of tryRead() might differ between emulation and simulation. |
T tryRead(bool& success, bool& sop, bool& eop) |
Available only if usesPackets<true> is set. Non-blocking read with out-of-band startofpacket and endofpacket signals. |
T tryRead(bool& success, bool& sop, bool& eop, int& empty) | Available only if
usesPackets<true> and
usesEmpty<true> are
set. Non-blocking read with out-of-band startofpacket, endofpacket, and emptysignals. |
void write(T data) | Blocking write call to be used from the testbench to populate the FIFO to be sent to the component. |
void write(T data, bool sop, bool eop) |
Available only if usesPackets<true> is set. Blocking write call with out-of-band startofpacket and endofpacket signals. |
void write(T data, bool sop, bool eop, int empty) | Available only if
usesPackets<true> and
usesEmpty<true> are
set. Blocking write call with out-of-band startofpacket, endofpacket, and empty signals. |
Streaming Output Interfaces
Template Object or Parameter | Description |
---|---|
ihc::stream_out | Streaming output interface from the component. |
ihc::readylatency | Specifies the number of cycles between when the ready signal is deasserted and when the input stream can no longer accept new inputs. |
ihc::bitsPerSymbol | Describes how the data is broken into symbols on the data bus. |
ihc::firstSymbolInHighOrderBits | Specifies whether the data symbols in the stream are in big endian order. |
ihc::usesPackets | Exposes the startofpacket and endofpacket sideband signals on the stream interface. |
ihc::usesEmpty |
Exposes the empty out-of-band signal on the stream interface. |
ihc::usesReady | Controls whether a ready signal is present. |
Function API | Description |
---|---|
void write(T data) | Blocking write call from the component |
void write(T data, bool sop, bool eop) |
Available only if usesPackets<true> is set. Blocking write with out-of-band startofpacket and endofpacket signals. |
void write(T data, bool sop, bool eop, int empty) | Available only if
usesPackets<true> and
usesEmpty<true> are
set. Blocking write with out-of-band startofpacket, endofpacket, and empty signals. |
bool tryWrite(T data) | Non-blocking write call from the component. The return value represents whether the write was successful. |
bool tryWrite(T data, bool sop, bool eop) |
Available only if usesPackets<true> is set. Non-blocking write with out-of-band startofpacket and endofpacket signals.The return value represents whether the write was successful. That is, the downstream interface was pulling the ready signal high while the HLS component tried to write to the stream. |
bool tryWrite(T data, bool sop, bool eop, int empty) | Available only if
usesPackets<true> and
usesEmpty<true> are
set. Non-blocking write with out-of-band startofpacket, endofpacket, and empty signals. The return value represents whether the write was successful. |
T read() | Blocking read call to be used from the testbench to read back the data from the component |
T read(bool &sop, bool &eop) |
Available only if usesPackets<true> is set. Blocking read call to be used from the testbench to read back the data from the component with out-of-band startofpacket and endofpacket signals. |
T read(bool &sop, bool &eop, int &empty) | Available only if
usesPackets<true> and
usesEmpty<true> are
set. Blocking read call to be used from the testbench to read back the data from the component with out-of-band startofpacket, endofpacket, and empty signals. |
4.3. Pipes
Using global memory to move data in and out of object libraries or tasks can constrain the performance of your design. Pipes provide a mechanism for passing data with high efficiency and low latency by using on-device FIFO buffers to communicate.
Pipes are similar to streams, but are simpler. When compared with streams, a pipe supports only ready/valid and data signals. You can use pipes in a component or between tasks.
Unlike streams, you can have an array of pipes and you can iterate over the array to write or read many pipes in a design.
Data written to a pipe remains in the pipe until it is read or until the component is reset.
The memory model of pipes allows you to use them to send and receive data from running functions in an object library or a running task function.
Pipe Properties
- FIFO ordering
- Data is accessible (readable) only in FIFO order. There is no concept of a memory address or pointer in a pipe, which means random data access is not possible with pipes.
- Capacity
- Pipes have a capacity. That is, a fixed amount of data can be
written to an initially empty pipe before needing to read anything from it to make
room for more data.
A full pipe applies back pressure to the write site.
Pipe Accessors
Data is written to a pipe through an API that commits a single word of the pipe data type, and that word is later returned by an API that reads data from the pipe.
- Blocking API calls
- Blocking write API calls wait until the pipe has enough capacity to
commit data.
Blocking read API calls wait until the pipe contains data to be read.
- Nonblocking API calls
- Nonblocking API calls execute immediately and return a status that indicates whether the operation was successful.
You can mix blocking and nonblocking accesses to the same pipe. For example you can write data to a pipe with a blocking pipe write() call and read it from the other end using a non-blocking read() call, and vice versa.
Data Persistence in Pipes
Data written to a pipe by a component remains in the pipe until it is read or until the component is reset.
Data written to a pipe by a task remains in the pipe until another task reads from the pipe or the component containing the tasks is reset.
The sequence of data in a pipe always follows FIFO ordering.
Data in pipes does not persist across FPGA device resets or reprogramming.
Restrictions on using Pipes
- Multiple pipe call sites
- A component or task function can read from the same pipe multiple times, but multiple component or task functions cannot read from the same pipe. Similarly, a component or task function can write to the same pipe multiple times, but multiple component or task functions cannot write to the same pipe.
- Feedback and feed-forward pipes
- A component or task function should use separate pipes for pipe reads and pipe writes. Writing to and reading from the same pipe within the same function can lead to poor performance.
- Pipe accesses in loops
- Do not use nonblocking pipes if you use a loop structure that waits
for data from the pipe. That is, avoid the following code pattern for nonblocking pipe
accesses:
bool success = false; while (!success) { my_pipe::write(rd_src_buf[i], success); // can also be a nonblocking read }
Use a blocking access with this code pattern instead because a blocking access in this code pattern is more efficient in hardware than a nonblocking access in this code pattern.
4.3.1. The pipe Class and Its Use
The pipe class exposes static methods for writing a data word to a pipe and reading a data word from a pipe. The reads and writes can be blocking or nonblocking, with the form chosen based on the overload resolution.
template <class name, class dataT, size_t min_capacity = 0> class pipe { public: // Blocking static dataT read(); static void write(dataT data); // Non-blocking static dataT read(bool &success); static void write(dataT data, bool &success); }
Parameter | Description |
---|---|
name | The type that is
used to
create a unique identifier for the
pipe. It is typically a user-defined class, in a user namespace. Forward declaration of the type is enough, and the type need not be defined. |
dataT | The data type of the packet contained within a pipe. This is the data type that is read during a successful pipe read() operation, or written during a successful pipe write() operation. The type must have a standard layout and be trivially copyable. |
min_capacity | The minimum number of words (in units of dataT) that the pipe must be able to store without any
being read out. The compiler might create a pipe with a larger capacity due to performance considerations. |
Example Using Pipes
#include "HLS/hls.h" template<unsigned ID, class T, unsigned pipe_capacity> class TaskSystem { private: template<unsigned SystemID> class InputPipeID {}; template<unsigned SystemID> class TaskPipeID {}; template<unsigned SystemID> class OutputPipeID {}; public: using input_pipe = ihc::pipe<class InputPipeID<ID>, T, pipe_capacity>; using output_pipe = ihc::pipe<class OutputPipeID<ID>, T, pipe_capacity>; using task_pipe = ihc::pipe<class TaskPipeID<ID>, T, pipe_capacity>; static void first_task(unsigned N) { T data; for(unsigned i=0; i<N; ++i) { data = input_pipe::read(); task_pipe::write(data); } } static void second_task(unsigned N) { T data; for(unsigned i=0; i<N; ++i) { data = task_pipe::read(); output_pipe::write(data); } } };
With this header file, first_task and second_task can be called from separate task functions to achieve concurrency.
#include "HLS/hls.h" #include <iostream> #include “task_system.h” unsigned constexpr ID = 42; // can be any unique value unsigned constexpr CAPACITY = 100; using MySystem = TaskSystem<ID, int, CAPACITY>; int main() { ihc::launch<MySystem::first_task>(CAPACITY); ihc::launch<MySystem::second_task>(CAPACITY); for(int i = 0; i < CAPACITY; ++i) { std::cout << "input: " << i << "\n"; MySystem::input_pipe::write(i); } for(int i = 0; i < CAPACITY; ++i) { int data = MySystem::output_pipe::read(); std::cout << "output: " << data << "\n"; } return 0; }
Example of an Array of Pipes
The following code example implements an array of pipes using templates, and includes functions to write to such an array.
#include "HLS/hls.h" // PipeArray template <class ArrayID, typename T, unsigned pipeCapacity, unsigned arraySize> class PipeArray { private: template <unsigned idx> struct StructIndex; template <unsigned idx> struct VerifyIndex { static_assert(idx < arraySize, "Index out of bounds"); using VerifiedPipe = ihc::pipe<StructIndex<idx>, T, pipeCapacity>; }; public: template <unsigned idx> using pipe_at = typename VerifyIndex<idx>::VerifiedPipe; static void write_to_pipes(T *values); }; // Write Unroller template <class ArrayID, typename T, unsigned pipeCapacity, unsigned arraySize, unsigned idx> struct WriteUnroller { using my_array = PipeArray<ArrayID, T, pipeCapacity, arraySize>; static void write_to_pipes_impl(T *values) { my_array::template pipe_at<idx>::write(values[idx]); WriteUnroller<ArrayID, T, pipeCapacity, arraySize, idx + 1>::write_to_pipes_impl(values); } }; template <class ArrayID, typename T, unsigned pipeCapacity, unsigned arraySize> struct WriteUnroller<ArrayID, T, pipeCapacity, arraySize, arraySize> { static void write_to_pipes_impl(T *values) {} }; // Write function template <class ArrayID, typename T, unsigned pipeCapacity, unsigned arraySize> void PipeArray<ArrayID, T, pipeCapacity, arraySize>::write_to_pipes(T *values) { WriteUnroller<ArrayID, T, pipeCapacity, arraySize, 0>::write_to_pipes_impl( values); }
The function PipeArray::write_to_pipes takes an array of values to be written, and calls WriteUnroller::write_to_pipes_impl, which uses recursive templating to write to each pipe in the array. Reading from the array of pipes would have a similar implementation.
4.4. Avalon Memory-Mapped Master Interfaces
Each mm_master argument of a component results in an input conduit for the address. That input conduit is associated with the component start and busy signals. In addition to this input conduit, a unique Avalon® MM Master interface is created for each address space. Master interfaces that share the same address space are arbitrated on the same interface.
For more information about Avalon® MM Master interfaces, refer to "Avalon Memory-Mapped Interfaces" in Avalon Interface Specifications.
Template Object or Parameter | Description |
---|---|
ihc::mm_master | The underlying pointer type. |
ihc::dwidth | The width of the memory-mapped data bus in bits |
ihc::awidth | The width of the memory-mapped address bus in bits. |
ihc::aspace | The address space of the interface that associates with the master. |
ihc::latency | The guaranteed latency from when a read command exits the component when the external memory returns valid read data. |
ihc::maxburst | The maximum number of data transfers that can associate with a read or write transaction. |
ihc::align | The alignment of the base pointer address in bytes. |
ihc::readwrite_mode | The port direction of the interface. |
ihc::waitrequest |
Adds the waitrequest signal that is asserted by the slave when it is unable to respond to a read or write request. |
getInterfaceAtIndex | This testbench function is used to index into an mm_master object. |
4.4.1. Memory-Mapped Master Testbench Constructor
To create an mm_master<> object, add the following constructor in your code:
ihc::mm_master<int, … > mm(void* ptr, int size, bool use_socket=false);
where the constructor arguments are as follows:
- ptr is the underlying pointer to the memory in the testbench
- size is the total size of the buffer in bytes
-
use_socket is the option you use to
override the copying of the memory buffer and have all the memory accesses pass back to the
testbench memory
By default, the Intel® HLS Compiler copies the memory buffer over to the simulator and then copies it back after the component has run. In some cases, such as pointer-chasing in linked lists, copying the memory buffer back and forth is undesirable. You can override this behavior by setting use_socket to true.
Note: When you set use_socket to true, only Avalon® MM Master interfaces with 64-bit wide addresses are supported. In addition, setting this option increases the run time of the simulation.
4.4.2. Implicit and Explicit Examples of Describing a Memory Interface
Implicit Example
The following code example arbitrates the load and store instructions from both pointer dereferences to a single interface on the component's top-level module. This interface will have a data bus width of 64 bits, an address width of 64 bits, and a fixed latency of 1.
#include "HLS/hls.h" component void dut(int *ptr1, int *ptr2) { *ptr1 += *ptr2; *ptr2 += ptr1[1]; } int main(void) { int x[2] = {0, 1}; int y = 2; dut(x, &y); return 0; }
Explicit Example
This example demonstrates how to optimize the previous code snippet for a specific memory interface using the explicit mm_master class. The mm_master class has a defined template, and it has the following characteristics:
- Each interface is given a unique ID that infers two independent interfaces and reduces the amount of arbitration within the component.
- The data bus width is larger than the default width of 64 bits.
- The address bus width is smaller than the default width of 64 bits.
- The interfaces have a fixed latency of 2.
By defining these characteristics, you state that your system returns valid read data after exactly two clock cycles and that the interface never stalls for both reads and writes, but the system must be able to provide two different memories. A unique physical Avalon® MM master port ( as specified by the aspace parameter) is expected to correspond to a unique physical memory. If you connect multiple Avalon® MM Master interfaces with different physical Avalon® MM master ports to the same physical memory, the Intel® HLS Compiler cannot ensure functional correctness for any memory dependencies.
#include "HLS/hls.h" typedef ihc::mm_master<int, ihc::dwidth<256>, ihc::awidth<32>, ihc::aspace<1>, ihc::latency<2> > Master1; typedef ihc::mm_master<int, ihc::dwidth<256>, ihc::awidth<32>, ihc::aspace<4>, ihc::latency<2> > Master2; component void dut(Master1 &mm1,Master2 &mm2) { *mm1 += *mm2; *mm2 += mm1[1]; } int main(void) { int x[2] = {0, 1}; int y = 2; Master1 mm_x(x,2*sizeof(int),false); Master2 mm_y(&y,sizeof(int),false); dut(mm_x, mm_y); return 0; }
4.4.3. Avalon Memory-Mapped Master Interfaces and Load-Store Units
When your component uses one or more Avalon® Memory-Mapped (MM) Master interfaces, the Intel® HLS Compiler inserts load-store units (LSUs) in the datapath between the interface and the rest of your component datapath. The type of LSU inserted depends on the inferred memory access pattern and other memory attributes.
The Intel® HLS Compiler also tries to minimize the number of LSUs created by coalescing multiple load/store operations into wider load/store operations. Multiple LSUs can share a memory interface.
Typically, the Intel® HLS Compiler creates burst-coalesced LSUs for variable-latency MM Master interfaces and pipelined LSUs for fixed-latency MM Master interfaces.
For details about the types of the LSUs and when the Intel® HLS Compiler typically instantiates them, see Load-Store Unit Types and Memory-Access Coalescing and Load-Store Units.
If your design contains one or more variable-latency Avalon® MM Master interfaces (for example, if you interface with off-chip memory), you can control the LSU type to improve the performance and resource utilization of your design.
Use the high-level design reports to determine what types of LSUs your component has, and then you can apply these LSU controls as needed to achieve the component performance that you want.
Template Object/Parameter/Function |
Description |
---|---|
ihc::lsu | The underlying LSU class template object |
ihc::style | Specifies the type of load-store unit. |
ihc::static_coalescing | Explicitly allows or prevents static coalescing of a load/store operation with other load/store operations. |
load | Loads data from memory into the LSU. |
store | Stores data from the LSU into memory. |
4.4.3.1. Load-Store Unit Types
The Intel® HLS Compiler determines the types of load-store units (LSUs) to instantiate and whether to coalesce memory accesses based on from the memory access pattern that the compiler infers.
- Burst-coalesced LSUs
- Nonaligned burst-coalesced LSUs
- The Intel® HLS Compiler typically instantiates burst-coalesced LSUs for accessing variable-latency Avalon® MM Master interfaces.
- Pipelined LSUs
- Never-stall pipelined LSUs
- The Intel® HLS Compiler typically instantiates pipelined LSUs for accessing fixed-latency Avalon® MM Master interfaces.
Click LSUs in the Graph Viewer (in the High-Level Design Reports) to see which types of LSU the compiler instantiated for your component.

Burst-Coalesced Load-Store Units
By default, the compiler infers burst-coalesced load-store units (LSUs) for any variable-latency Avalon® MM Master interface.
A burst-coalesced LSU dynamically buffers contiguous memory requests until the largest possible burst can be made. The largest possible burst is defined by the ihc::maxburst parameter. For noncontiguous memory requests, a burst-coalesced LSU flushes the buffer between requests.
Burst-coalsced LSUs provide efficient, variable-latency access to memories outside of your component. However, they require a considerable amount of FPGA resources.
#include "HLS/hls.h" component void burst_coalesced(ihc::mm_master<int, ihc::dwidth<64>, ihc::awidth<32>, ihc::aspace<1>, ihc::latency<0>> &in, ihc::mm_master<int, ihc::dwidth<64>, ihc::awidth<32>, ihc::aspace<2>, ihc::latency<0>> &out, int i) { int value = in[i / 2]; // Burst-coalesced LSU out[i] = value; // Burst-coalesced LSU }
Depending on the memory access pattern and other attributes, the compiler might modify a burst-coalesced LSU to be a nonaligned burst-coalesced LSU.
Nonaligned Burst-coalesced LSUs
When a burst-coalesced LSU can access a memory that is not aligned to the external memory word size, the Intel® HLS Compiler creates a nonaligned burst-coalesced LSU. Nonaligned LSUs typically require more FPGA resources to implement than aligned LSUs. The throughput of a nonaligned LSU might be reduced if it receives many unaligned requests.
#include "HLS/hls.h" struct State { int x; int y; int z; }; component void static_coalescing(ihc::mm_master<State, ihc::dwidth<128>, ihc::awidth<32>, ihc::aspace<1>, ihc::latency<0>> &in, ihc::mm_master<State, ihc::dwidth<128>, ihc::awidth<32>, ihc::aspace<2>, ihc::latency<0>> &out, int i) { out[i] = in[i]; // Two Nonaligned Burst-coalesced LSUs
The figure that follows (Figure 5) shows the external memory contents for the previous code example and the nonaligned burst-coalesced LSUs in the component pipeline.
The data type that is read and written is a 96-bit-wide struct. The external memory width is 128 bits. This difference between the read/write data width and the external memory width forces some of the memory requests to span two consecutive memory words.
Pipelined Load-Store Units
By default, the compiler infers pipelined load-store units (LSUs) for any fixed-latency Avalon® MM Master interface and on-device memories
In a pipelined LSU, requests are submitted when they are received and no buffering occurs. Pipelined LSUs are also used for accessing memories inside your component.
You can tell the compiler to instantiate pipelined LSUs for variable-latency MM Master interfaces. However, variable-latency interface access with pipelined LSUs might reduce throughput because pipelined LSUs do not combine sequential memory requests into bursts.
Memory accesses are pipelined, so multiple requests can be in flight at the same time.
#include "HLS/hls.h"
component void
pipelined(ihc::mm_master<int, ihc::dwidth<64>, ihc::awidth<32>,
ihc::aspace<1>, ihc::latency<2>> &in,
ihc::mm_master<int, ihc::dwidth<64>, ihc::awidth<32>,
ihc::aspace<1>, ihc::latency<2>> &out,
int gi, int li) {
int lmem[1024];
int res = in[gi]; // Pipelined LSU
for (int i = 0; i < 4; i++) {
lmem[li - i] = res; // Pipelined LSU
res >>= 1;
}
res = 0;
for (int i = 0; i < 4; i++) {
res ^= lmem[li - i]; // Pipelined LSU
}
out[gi] = res; // Pipelined LSU
}
Never-Stall Pipelined LSUs
If a pipelined LSU is connected to a memory inside the component or to a fixed-latency MM Master interface without arbitration, a never-stall LSU is created because all accesses to the memory take a fixed number of cycles that are known to the compiler.
#include "HLS/hls.h" component void neverstall(ihc::mm_master<int, ihc::dwidth<128>, ihc::awidth<32>, ihc::aspace<1>, ihc::latency<0>> &in, ihc::mm_master<int, ihc::dwidth<128>, ihc::awidth<32>, ihc::aspace<1>, ihc::latency<0>> &out, int gi, int li) { int lmem[1024]; for (int i = 0; i < 1024; i++) lmem[i] = in[i]; // Pipelined never-stall LSU out[gi] = lmem[li] ^ lmem[li + 1]; // Pipelined never-stall LSU }
4.4.3.2. Memory-Access Coalescing and Load-Store Units
The Intel® HLS Compiler sometimes coalesces multiple memory accesses into a wider memory access to save on the number of LSUs instantiated.
When the compiler coalesces the memory accesses, it is referred to as static coalescing because the coalescing occurs at compile time. This static coalescing contrasts with the dynamic coalescing done by a burst-coalesced LSU.
The compiler typically attempts static coalescing when it detects multiple memory operations that access consecutive locations in memory. This coalescing is usually beneficial because it reduces the number of LSUs that compete for a shared memory interface.
The compiler coalesces memory accesses only up to the width of the memory interface that is being accessed. For an external memory interface, the maximum width is predetermined by the properties of the external memory that you are accessing. For a component (internal) memory interface, the maximum width can be set by the compiler based on the memory geometry that the compiler creates. For more details about component memories, see Component Memories (Memory Attributes).
For the following code example, the Intel® HLS Compiler statically coalesces the four 4-byte-wide load operations into one 16-byte-wide load operation. A similar coalescing is done for the four store operations. Coalescing the load and store operations reduces the number of required accesses to the Avalon MM Master interfaces by 4.
#include "HLS/hls.h"
component void
static_coalescing(ihc::mm_master<int, ihc::dwidth<128>, ihc::awidth<32>,
ihc::aspace<1>, ihc::latency<0>> &in,
ihc::mm_master<int, ihc::dwidth<128>, ihc::awidth<32>,
ihc::aspace<2>, ihc::latency<0>> &out,
int i) {
// Four loads statically coalesced into one 16 bytes wide load
int a1 = in[3 * i + 0];
int a2 = in[3 * i + 1];
int a3 = in[3 * i + 2];
int a4 = in[3 * i + 3];
// Four stores statically coalesced into one 16 bytes wide store
out[3 * i + 0] = a4;
out[3 * i + 1] = a3;
out[3 * i + 2] = a2;
out[3 * i + 3] = a1;
}

4.5. Slave Interfaces
Slave interfaces are implemented as Avalon® Memory Mapped ( Avalon® MM) Slave interfaces. For details about the Avalon® MM Slave interfaces, see "Avalon Memory-Mapped Interfaces in Avalon Interface Specifications.
Slave Type | Associated Slave Interface | Read/Write Behavior | Synchronization | Read Latency | Controlling Interface Data Width |
---|---|---|---|---|---|
Register | The component CSR slave. | The component cannot update these registers from the datapath, so you can read back only data that you wrote in. | Synchronized with the component start signal. | Fixed value of 1. | Always 64 bits |
Memory (M20K/MLAB) | Dedicated slave interface on the component. |
The component reads from this memory and updates it as it runs. Updates from the component datapath are visible in memory. |
Reads and writes to slave memories from outside of the component should occur only when your component is not executing. You might experience undefined component behavior if outside slave memory accesses occur when your component is executing. The undefined behavior can occur even if a slave memory access is to a memory address that the component does not access. |
Fixed value that is dependent on the component memory access pattern
and any attributes or pragmas that you set. See the Function Viewer report in the High-Level Design Report (report.html) for the read latency of a specific slave memory argument. |
The data width is a multiple of the slave data type, where the multiple is determined by coalescing the internal accesses. |
4.5.1. Control and Status Register (CSR) Slave
Any parameters that are labeled as hls_avalon_slave_register_argument are located in this memory space. The resulting memory map is described in the automatically generated header file <results>.prj/components/<component_name>_csr.h. This file also provides the C macros for a master component to interact with the slave component. Examples of master components include Nios® II soft processors and Intel® Acceleration Stack host applications.
The control and status registers (that is, function call and return) of a component with the hls_avalon_slave_component attribute are implemented in the CSR slave interface.
You do not need to use the hls_avalon_slave_component attribute to use the hls_avalon_slave_register_argument attribute.
To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/interfaces/mm_slaves
Example code of a component with a CSR slave:
#include "HLS/hls.h"
struct MyStruct {
int f;
double j;
short k;
};
hls_avalon_slave_component
component MyStruct mycomp_xyz (hls_avalon_slave_register_argument int y,
hls_avalon_slave_register_argument MyStruct struct_argument,
hls_avalon_slave_register_argument unsigned long long mylong,
hls_avalon_slave_register_argument char char_arg
) {
return struct_argument;
}
Generated C header file for the component mycomp_xyz:
/* This header file describes the CSR Slave for the mycomp_xyz component */ #ifndef __MYCOMP_XYZ_CSR_REGS_H__ #define __MYCOMP_XYZ_CSR_REGS_H__ /******************************************************************************/ /* Memory Map Summary */ /******************************************************************************/ /* Register | Access | Register Contents | Description Address | | (64-bits) | ------------|---------|--------------------------|----------------------------- 0x0 | R | {reserved[62:0], | Read the busy status of | | busy[0:0]} | the component | | | 0 - the component is ready | | | to accept a new start | | | 1 - the component cannot | | | accept a new start ------------|---------|--------------------------|----------------------------- 0x8 | W | {reserved[62:0], | Write 1 to signal start to | | start[0:0]} | the component ------------|---------|--------------------------|----------------------------- 0x10 | R/W | {reserved[62:0], | 0 - Disable interrupt, | | interrupt_enable[0:0]} | 1 - Enable interrupt ------------|---------|--------------------------|----------------------------- 0x18 | R/Wclr | {reserved[61:0], | Signals component completion | | done[0:0], | done is read-only and | | interrupt_status[0:0]} | interrupt_status is write 1 | | | to clear ------------|---------|--------------------------|----------------------------- 0x20 | R | {returndata[63:0]} | Return data (0 of 3) ------------|---------|--------------------------|----------------------------- 0x28 | R | {returndata[127:64]} | Return data (1 of 3) ------------|---------|--------------------------|----------------------------- 0x30 | R | {returndata[191:128]} | Return data (2 of 3) ------------|---------|--------------------------|----------------------------- 0x38 | R/W | {reserved[31:0], | Argument y | | y[31:0]} | ------------|---------|--------------------------|----------------------------- 0x40 | R/W | {struct_argument[63:0]} | Argument struct_argument (0 of 3) ------------|---------|--------------------------|----------------------------- 0x48 | R/W | {struct_argument[127:64]} | Argument struct_argument (1 of 3) ------------|---------|--------------------------|----------------------------- 0x50 | R/W | {struct_argument[191:128]} | Argument struct_argument (2 of 3) ------------|---------|--------------------------|----------------------------- 0x58 | R/W | {mylong[63:0]} | Argument mylong ------------|---------|--------------------------|----------------------------- 0x60 | R/W | {reserved[55:0], | Argument char_arg | | char_arg[7:0]} | NOTE: Writes to reserved bits will be ignored and reads from reserved bits will return undefined values. */ /******************************************************************************/ /* Register Address Macros */ /******************************************************************************/ /* Byte Addresses */ #define MYCOMP_XYZ_CSR_BUSY_REG (0x0) #define MYCOMP_XYZ_CSR_START_REG (0x8) #define MYCOMP_XYZ_CSR_INTERRUPT_ENABLE_REG (0x10) #define MYCOMP_XYZ_CSR_INTERRUPT_STATUS_REG (0x18) #define MYCOMP_XYZ_CSR_RETURNDATA_0_REG (0x20) #define MYCOMP_XYZ_CSR_RETURNDATA_1_REG (0x28) #define MYCOMP_XYZ_CSR_RETURNDATA_2_REG (0x30) #define MYCOMP_XYZ_CSR_ARG_Y_REG (0x38) #define MYCOMP_XYZ_CSR_ARG_STRUCT_ARGUMENT_0_REG (0x40) #define MYCOMP_XYZ_CSR_ARG_STRUCT_ARGUMENT_1_REG (0x48) #define MYCOMP_XYZ_CSR_ARG_STRUCT_ARGUMENT_2_REG (0x50) #define MYCOMP_XYZ_CSR_ARG_MYLONG_REG (0x58) #define MYCOMP_XYZ_CSR_ARG_CHAR_ARG_REG (0x60) /* Argument Sizes (bytes) */ #define MYCOMP_XYZ_CSR_RETURNDATA_0_SIZE (8) #define MYCOMP_XYZ_CSR_RETURNDATA_1_SIZE (8) #define MYCOMP_XYZ_CSR_RETURNDATA_2_SIZE (8) #define MYCOMP_XYZ_CSR_ARG_Y_SIZE (4) #define MYCOMP_XYZ_CSR_ARG_STRUCT_ARGUMENT_0_SIZE (8) #define MYCOMP_XYZ_CSR_ARG_STRUCT_ARGUMENT_1_SIZE (8) #define MYCOMP_XYZ_CSR_ARG_STRUCT_ARGUMENT_2_SIZE (8) #define MYCOMP_XYZ_CSR_ARG_MYLONG_SIZE (8) #define MYCOMP_XYZ_CSR_ARG_CHAR_ARG_SIZE (1) /* Argument Masks */ #define MYCOMP_XYZ_CSR_RETURNDATA_0_MASK (0xffffffffffffffffULL) #define MYCOMP_XYZ_CSR_RETURNDATA_1_MASK (0xffffffffffffffffULL) #define MYCOMP_XYZ_CSR_RETURNDATA_2_MASK (0xffffffffffffffffULL) #define MYCOMP_XYZ_CSR_ARG_Y_MASK (0xffffffff) #define MYCOMP_XYZ_CSR_ARG_STRUCT_ARGUMENT_0_MASK (0xffffffffffffffffULL) #define MYCOMP_XYZ_CSR_ARG_STRUCT_ARGUMENT_1_MASK (0xffffffffffffffffULL) #define MYCOMP_XYZ_CSR_ARG_STRUCT_ARGUMENT_2_MASK (0xffffffffffffffffULL) #define MYCOMP_XYZ_CSR_ARG_MYLONG_MASK (0xffffffffffffffffULL) #define MYCOMP_XYZ_CSR_ARG_CHAR_ARG_MASK (0xff) /* Status/Control Masks */ #define MYCOMP_XYZ_CSR_BUSY_MASK (1<<0) #define MYCOMP_XYZ_CSR_BUSY_OFFSET (0) #define MYCOMP_XYZ_CSR_START_MASK (1<<0) #define MYCOMP_XYZ_CSR_START_OFFSET (0) #define MYCOMP_XYZ_CSR_INTERRUPT_ENABLE_MASK (1<<0) #define MYCOMP_XYZ_CSR_INTERRUPT_ENABLE_OFFSET (0) #define MYCOMP_XYZ_CSR_INTERRUPT_STATUS_MASK (1<<0) #define MYCOMP_XYZ_CSR_INTERRUPT_STATUS_OFFSET (0) #define MYCOMP_XYZ_CSR_DONE_MASK (1<<1) #define MYCOMP_XYZ_CSR_DONE_OFFSET (1) #endif /* __MYCOMP_XYZ_CSR_REGS_H__ */
4.5.2. Slave Memories
- The master interface has a single port. If the component has multiple load-store sites, arbitration on that port might create stallable logic.
- Depending on the system in which the component is instantiated, other masters might use the memory bus while the component is running and create undesirable stalls on the bus.
Because a slave memory is internal to the component, the HLS compiler can create a memory architecture that is optimized for the access pattern of the component such as creating banked memories or coalescing memory accesses.
Slave memories differ from component memories because they can be accessed from an Avalon® MM Master outside of the component. Component memories are by definition restricted to the component and cannot be accessed outside the component.
You can explicitly control the structure of your slave memories by applying memory arguments to slave memory variable declarations.
A component can have many slave memory interfaces. Unlike slave register arguments that are grouped together in the CSR slave interface, each slave memory has a separate interface with separate data buses. The slave memory interface data bus width is determined by the width of the slave type. If the internal accesses to the memory have been coalesced, the slave memory interface data bus width might be a multiple of the width of the slave type.
You can apply component memory attributes to slave memories in your component to customize the memory architecture and lower the FPGA area utilization of your component. For details, refer to Component Memory Attributes.
component float myComponent( hls_avalon_slave_memory(64) float *a, hls_avalon_slave_memory(64) float *b) { return a[0]+b[0]; }
Reads and writes to slave memories from outside of the component should occur only when your component is not executing, unless you mark the slave memory argument with the volatile keyword. Without the volatile keyword, you might experience undefined component behavior if an external Avalon MM Master accesses your component slave memory when your HLS component is executing. The undefined behavior can occur even if the external access is to a memory address that your HLS component does not access.
Volatile Slave Memories
Add the volatile keyword to a slave memory argument to allow an Avalon MM Master to access the memory while the component executes. The compiler builds a memory system that ensures functional correctness even if the memory is modified from outside the component while the component is running.
However, allowing concurrent memory accesses with the volatile keyword might prevent some optimizations that the compiler can perform on your component. Consider this trade-off carefully in your design.
Depending on how the slave memory is accessed from within the component, the compiler might infer arbitration logic between the external memory access and internal memory accesses. In this case, the external access always gets arbitration priority over internal accesses. This prioritization might prevent the data flow in the component from progressing when external requests arrive at every cycle.You cannot verify the behavior of concurrent component memory access during simulation. To test the behavior, you must build a Verilog testbench to interact with your component. You can see an example of a Verilog testbench and how to use it in the following tutorial:
<quartus_installdir>/hls/examples/tutorials/interfaces/mm_slaves_CSR_volatile
Avalon® MM Master Access Control
- Use hls_readwrite_mode("readonly") to indicate that the external Avalon® MM Master interface only ever reads from the slave memory. If you specify this macro, no write ports from outside of the component are created.
- Use hls_readwrite_mode("writeonly") to indicate that the external Avalon® MM Master interface only ever writes to the slave memory. If you specify this macro, no read ports from outside of the component are created.
If you do not specify this macro, the compiler assumes that the external Avalon® MM Master interface can read or write to the slave memory.
You can use the macro to help the compiler create a more efficient memory system and potentially save FPGA area.
You can use the hls_readwrite_mode macro for both volatile and non-volatile slave memories.
component void foo(hls_avalon_slave_memory_argument(128*sizeof(int)) hls_readwrite_mode(“writeonly”) int *A)
Slave Memory Component Macros
Component Macro | Description |
---|---|
hls_avalon_slave_memory_argument | Implement the parameter, in on-chip memory blocks, which can be read from or written to over a dedicated slave interface. |
hls_readwrite_mode | Indicate to the compiler how the slave memory interface is accessed by external Avalon® memory-mapped (MM) masters. |
4.6. Stable Component Parameters
If the corresponding argument for a parameter does not change while your component executes, you can mark the parameter as stable. The arguments for a stable parameter can still change after all active component invocations have finished (that is, the component datapath is flushed). If an argument changes when there are active component invocations in progress, you component enters an undefined state and behaves unpredictably.
Declare an interface parameter to be stable with the hls_stable_argument attribute.
- Scalar (conduit) parameters
- Pointer interface
parameters
The address conduit input is stable. The associated Avalon MM Master interface is not affected.
- Pass-by-reference
parameters
The address conduit input is stable. The associated Avalon MM Master interface is not affected.
-
Avalon®
Memory-Mapped (MM)
Master interface
parameters
The address conduit input is stable. The associated Avalon MM Master interface is not affected.
- Avalon® Memory-Mapped (MM) Slave register interface parameters
- Avalon® Memory-Mapped (MM) Slave memory interface parameters
- Avalon® Streaming interface parameters
You might save some FPGA area in your component design when you declare an interface argument as stable because there is no need to carry the data with the pipeline.
You cannot have two component invocations in flight with different stable arguments between the two component invocations.
Attribute | Description |
---|---|
hls_stable_argument | A stable parameter is a parameter that does not change while there is live data in the component (that is, the argument does not change between pipelined function invocations). |
4.7. Global Variables
Components can use and update C++ global variables. If you access a global variable in your component function, it is implemented as an Avalon® Memory-Mapped (MM) Master interface, like a pointer parameter.
If you access more than one global variable, each global variable uses the same Avalon® MM Master interface, which results in stallable arbitration. If you use pointers and non-constant global memory accesses, then the pointers and global memory accesses all share the same Avalon® MM Master interface.
In addition to the Avalon® MM Master interface, each global variable that the component uses has an input conduit that must be supplied with the address of the global variable in system memory. The input conduit arguments that are generated in the RTL are named @<global variable name> . Input conduits generated for pointer arguments omit the @ are named for the corresponding pointer argument.
If your global variable is declared as const, then no Avalon® MM Master interface and no additional input conduit is generated. Therefore, global variables declared as const use significantly less FPGA area than modifiable global variable.
4.8. Structs in Component Interfaces
Review the interface_structs.sv file in your <a.prj>/components/<component_name> folder to see information about the padding and packed-ness of the implementation interfaces for the structs in your component.
The interface_structs.sv file contains the Verilog-style definitions of the structs found on your component interface.
4.9. Reset Behavior
For your HLS component, the reset assertion can be asynchronous but the reset deassertion must be synchronous.
reg [2:0] sync_resetn; always @(posedge clock or negedge resetn) begin if (!resetn) begin sync_resetn <= 3'b0; end else begin sync_resetn <= {sync_resetn[1:0], 1'b1}; end end wire synchronized_resetn; assign synchronized_resetn = sync_resetn[2];
This synchronizer code is used in the example Intel® Quartus® Prime Pro Edition project that is generated for your components included in an i++ compile.
When the reset is asserted, the component holds its busy signal high and its done signal low. After the reset is deasserted, the component holds its busy signal high until the component is ready to accept the next invocation. All component interfaces (slaves, masters, and streams) are ready only after the component busy signal is low.
Simulation Component Reset
You can check the reset behavior of your component during simulation by using the ihc_hls_sim_reset API. This API returns 1 if the reset was exercised (that is, if the reset is called during hardware simulation of the component). Otherwise, the API returns 0.
int ihc_hls_sim_reset(void);
During x86 emulation of your component, the ihc_hls_sim_reset API always returns 0. You cannot reset a component during x86 emulation.
5. Component Memories (Memory Attributes)
The Intel® High Level Synthesis (HLS) Compiler can build a hardware memory system using FPGA memory resources (such as block RAMs) for local, constant, and static variables as well as slave memory declared in your code.
In some cases, the Intel® HLS Compiler implements a local, constant, and static variable using registers in the component datapath. However, you can override that implementation by using memory attributes.
Memory accesses are mapped to load-store units (LSUs), which transact with the hardware memory through its ports. The Intel® HLS Compiler sometimes statically coalesces multiple memory accesses to a component memory into one wider memory access in order to save on the number of LSUs instantiated. LSUs for component memory are always pipelined LSUs.
If two or more LSUs need to be scheduled during the same cycle, the compiler might create stallable arbitration logic. Stallable arbitration logic appears red in the Component Viewer (in the High-Level Design Reports).
For more details about LSUs instantiated by the Intel® HLS Compiler, see Load-Store Unit Types. For details about coalescing memory accesses to save on instantiated LSUs, see Memory-Access Coalescing and Load-Store Units.
The following diagram shows a basic memory configuration:

The contents of a memory system can be partitioned into one or more memory banks, such that each bank contains a subset of data contained in the hardware memory:

A memory bank can contain one or more memory replicates. The compiler might create memory replicates to create more read ports. Having more read ports allows concurrent access to your memory system if you have many read operations.
The replicates in a memory bank contain identical data and you can read from the replicates simultaneously. A replicate can have two or four access ports, depending on whether the replicate is clocked at the same frequency (single pumped) or twice the frequency (double pumped) of the component. All ports in replicates can be accessed concurrently. The number of ports in a memory bank depends on the number of replicates that the bank contains.


The Intel® HLS Compiler can control the geometry and configuration parameters of the hardware memories that it builds. The compiler tries to create stall-free memory accesses. That is, the compiler tries to give memory reads and writes contention-free access to a memory port. A memory system is stall-free if all reads and writes in the memory system are contention-free.
The compiler tries to create a minimum-area stall-free memory system. If you want a different area-performance trade off, use the component memory attributes to specify your own memory system configuration and override the memory system inferred by the compiler.
Component Memory Attributes
Apply the component memory attributes to local, constant, and static variables in your component to customize the on-chip memory architecture of the component memory system and lower the FPGA area utilization of your component. You can also apply memory attributes to slave memories and struct data members.
These component memory attributes are defined in the "HLS/hls.h" header file, which you can include in your code.
Memory Attribute | Description |
---|---|
hls_force_pow2_depth | Specifies that the memory implementing the variable or array has power-of-2 depth. |
hls_register | Forces a variable or
array to be carried through the pipeline in
registers. A register variable can be implemented either exclusively in flip-flops (FFs) or in a mix of FFs and RAM-based FIFOs. |
hls_memory | Forces a variable or array to be implemented as embedded memory. |
hls_memory_impl | Forces a variable or array to be implemented as embedded memory of a specified type. |
hls_singlepump | Specifies that the memory implementing the variable or array must be clocked at the same rate as the component accessing the memory. |
hls_doublepump | Specifies that the memory implementing the variable or array must be clocked at twice the rate as the component accessing the memory. |
hls_numbanks | Specifies that the memory implementing the variable or array must have a defined number of memory banks. |
hls_bankwidth | Specifies that the memory implementing the variable or array must have memory banks of a defined width. |
hls_bankbits | Forces the memory system to split into a defined number of memory banks and defines the bits used to select a memory bank. |
hls_simple_dual_port_memory | Specifies that the memory implementing the variable or array should have no port that services both reads and writes. |
hls_merge (depthwise) | Allows merging two or more local variables to be implemented in component memory as a single merged memory system in a depth-wise manner. |
hls_merge (widthwise) | Allows merging two or more local variables to be implemented in component memory as a single merged memory system in a width-wise manner. |
hls_init_on_reset | Forces the static variables inside the component to be initialized when the component reset signal is asserted. |
hls_init_on_powerup | Sets the component memory implementing the static variable to initialize on power-up when the FPGA is programmed. |
hls_max_concurrency |
Deprecated: This attribute is
deprecated and will be removed in a
future release. Use the
hls_private_copies memory
attribute
instead.
Specifies that the memory
implementing the variable or array has a
defined number of private copies to allow
concurrent iterations of a loop at any given
time. |
hls_max_replicates | Specifies that the memory implementing the variable or array has no more than the specified number of replicates to enable simultaneous reads from the datapath |
hls_private_copies | Specifies that the memory implementing the variable or array has a defined number of private copies to allow concurrent iterations of a loop at any given time. |
Struct Datatypes and Memory Attributes
You can apply memory attributes to struct member variables in the struct declaration. If you also apply memory attributes to the object instantiation of a struct variable, the attributes on the instantiation override the attributes from the declaration.
struct State {
int array[100] hls_memory;
int reg[4] hls_register;
};
component int test(..) {
struct State S1;
struct State S2 hls_memory;
// some uses
}
For this example code, the compiler splits S1 into two variables, S1.array[100] (implemented in memory) and S1.reg[4] (implemented in registers). However, the compiler ignores the attributes applied at the struct declaration for object S2 because the S2 object has the hls_memory attribute applied at instantiation.
Constraints on Attributes for Memory Banks
The properties of memory banks constrain how you can divide component memory into banks with the memory bank attributes.
- The number of bytes in your array that you want to access at one time (S). If you are accessing a local variable, this value represents the size (in bytes) of the local variable.
- The number of memory banks specified by hls_numbanks attribute ().
- The width (in bytes) of the memory banks specified by hls_bankwidth attribute ().
- The number of memory bank-select bits specified by hls_bankbits attribute. That is, n+1 when you specify b 0 , b 1 , ..., b n as the bank-select bits ().
-
The number of bytes accessed concurrently (or size of a local variable) is equal to the number of memory banks it uses times the width of the memory banks.
- must be a power of 2 value.
-
bank-selection bits that are required to address number of memory banks.
Values that you specify for the hls_numbanks, hls_bankwidth, and hls_bankbits attributes must meet these constraints. For attributes that you do not specify, the Intel® HLS Compiler infers values for the attributes following these constraints.
5.1. Static Variables
The HLS compiler supports function-scope static variables with the same semantics as in C and C++.
Function-scope static variables are initialized to the specified values on reset. In addition, changes to these variables are visible across component invocations, making function-scope static variables ideal for storing state in a component. However, function-scope static variables cannot be shared by different task or component functions.
To initialize static variables, the component requires extra logic, and the component might take some time to exit the reset state while this logic is active.
Static Variable Initialization
Unlike a typical program, you can control when the static variables in your component are initialized if they are implemented as memories. The memory system that stores a static variable can be initialized either when your component is powered up or when your component is reset.
Initializing a static variable when a component is powered up resembles a traditional programming model where you cannot reinitialize the static variable value after the program starts to run.
Initializing a static variable when a component is reset initializes the static variable each time each time your component receives a reset signal, including on power up. However, this type of static variable initialization requires extra logic. This extra logic can affect the start-up latency and the FPGA area needed for your component.
- hls_init_on_reset
- The static variable value is initialized after the component is reset.Add this attribute to your static variable declaration as shown in the following example:
static char arr[128] hls_init_on_reset;
This is the default behavior for initializing static variables. You do not need to specify the hls_init_on_reset keyword with your static variable declaration to get this behavior.
For example, the static variable in the following example is initialized when the component is reset:static int arr[64];
- hls_init_on_powerup
- The static variable is initialized only on power up. This initialization uses a memory initialization file (.mif) to initialize the memory, which reduces the resource utilization and start-up latency of the component.
-
Add this keyword to your static variable declaration as shown in the following example:
static char arr[128] hls_init_on_powerup;
Some static variables might not be able to take advantage of this initialization because of the complexity of the static variables (for example, an array of structs). In these cases, the compiler returns an error.
For a demonstration of initializing static variables, review the tutorial in <quartus_installdir>/hls/examples/tutorials/component_memories/static_var_init.
For information about resetting your component, see Reset Behavior.
6. Loops in Components
Loop Pipelining

There are some cases where pipelining is not possible at all. In other cases, a new iteration of the loop cannot start until N cycles after the previous iteration.
The number of cycles for which a loop iteration must wait before it can start is called the initiation interval (II) of the loop. This loop pipelining status is captured in the high level design report (report.html). In general, an II of 1 is desirable.
A common case where II > 1 is when a part of the loop depends in some way on the results of the previous iteration of the same loop. The circuit must wait for these loop-carried dependencies to be resolved before starting a new iteration of the loop. These loop-carried dependencies are indicated in the optimization report.
In the case of nested loops, II > 1 for an outer loop is not considered a significant performance limiter if a critical inner loop carries out the majority of the work. One common performance limiter is if the HLS compiler cannot statically compute the trip count of an inner loop (for example, a variable inner loop trip count). Without a known trip count, the compiler cannot pipeline the outer loop.
For more information about loop pipelining, see Pipeline Loops in Intel® High Level Synthesis Compiler Best Practices Guide.
Compiler Pragmas Controlling Loop Pipelining
The Intel® HLS Compiler has several pragmas that you can specify in your code to control how the compiler pipelines your loops.
Incorrect (produces a compile-time error) | Correct |
---|---|
#pragma ivdep TEST_LOOP: for(int idx = 0; idx < counter; idx++) {...} |
TEST_LOOP: #pragma ivdep for(int idx = 0; idx < counter; idx++) {...} |
Pragma | Description |
---|---|
disable_loop_pipelining | Prevents compiler from pipelining a loop, |
ii | Forces a loop to have a loop initiation interval (II) of a specified value. |
ivdep | Ignores memory dependencies between iterations of this loop. |
loop_coalesce | Tries to fuse all loops nested within this loop into a single loop. |
loop_fuse | Directs the compiler to try and fuse pairs of adjacent loops. |
max_concurrency | Limits the number of iterations of a loop that can simultaneously execute at any time. |
max_interleaving | Controls whether iterations of a pipelined inner loop in a loop nest from one invocation of the inner loop can be interleaved in the component data pipeline with iterations from other invocations of the inner loop. |
nofusion | Prevents the annotated loop from being fused with adjacent loops. |
speculated_iterations | Specifies the number of clock cycles that a loop exit condition can take to compute. |
unroll | Unrolls the loop completely or by a number of times. |
For a list of tutorials that demonstrate best practices to follow when implementing loops and using the loop pragmas in your component, see Loop Best Practices in Intel® High Level Synthesis Compiler Best Practices Guide.
6.1. Loop Initiation Interval (ii Pragma)
- The loop is not critical to the throughput of your component.
- The running time of the loop is small compared to other loops it might contain.
You can also apply the ii pragma to force a loop to an II of 1 and accept a possible fMAX penalty.
#pragma ii <desired_initiation_interval>The <desired_initiation_interval> parameter is required and is an integer that specifies the number of clock cycles to wait between the beginning of execution of successive loop iterations.
6.2. Loop-Carried Dependencies (ivdep Pragma)
The ivdep pragma tells the compiler that a memory dependency between loop iterations can be ignored. Ignoring the dependency saves area and lowers the loop initiation interval (II) of the affected loop because the hardware required for avoiding data hazards is no longer required.
You can provide more information about loop dependencies by adding the safelen(N) clause to the ivdep pragma. The safelen(N) clause specifies the maximum number of consecutive loop iterations without loop-carried memory dependencies. For example, #pragma ivdep safelen(32) indicates to the compiler that there are a maximum of 32 iterations of the loop before loop-carried dependencies might be introduced. That is, while #pragma ivdep promises that there are no implicit memory dependency between any iteration of this loop, #pragma safelen(32) promises that the iteration that is 32 iterations away is the closest iteration that could be dependent on this iteration.
- a component memory array
- a pointer argument
- a pointer variable that points to a component memory
- a reference to an mm_master object
Use Case 1:
If all accesses to memory arrays inside a loop do not cause loop-carried dependencies, add #pragma ivdep before the loop.
1 // no loop-carried dependencies for A and B array accesses 2 #pragma ivdep 3 for(int i = 0; i < N; i++) { 4 A[i] = A[i + N]; 5 B[i] = B[i + N]; 6 }
Use Case 2:
You may specify #pragma ivdep array (array_name) on particular memory arrays instead of all array accesses. This pragma is applicable to arrays, pointers, or pointer members of structs. If the specified array is a pointer, the ivdep pragma applies to all arrays that may alias with the specified pointer.
1 // No loop-carried dependencies for A array accesses 2 // Compiler inserts hardware that reinforces dependency constraints for B 3 #pragma ivdep array(A) 4 for(int i = 0; i < N; i++) { 5 A[i] = A[i - X[i]]; 6 B[i] = B[i - Y[i]]; 7 } 8 9 // No loop-carried dependencies for array A inside struct 10 #pragma ivdep array(S.A) 11 for(int i = 0; i < N; i++) { 12 S.A[i] = S.A[i - X[i]]; 13 } 14 15 // No loop-carried dependencies for array A inside the struct pointed by S 16 #pragma ivdep array(S->X[2][3].A) 17 for(int i = 0; i < N; i++) { 18 S->X[2][3].A[i] = S.A[i - X[i]]; 19 } 20 21 // No loop-carried dependencies for A and B because ptr aliases 22 // with both arrays 23 int *ptr = select ? A : B; 24 #pragma ivdep array(ptr) 25 for(int i = 0; i < N; i++) { 26 A[i] = A[i - X[i]]; 27 B[i] = B[i - Y[i]]; 28 } 29 30 // No loop-carried dependencies for A because ptr only aliases with A 31 int *ptr = &A[10]; 32 #pragma ivdep array(ptr) 33 for(int i = 0; i < N; i++) { 34 A[i] = A[i - X[i]]; 35 B[i] = B[i - Y[i]]; 36 }
6.3. Loop Coalescing (loop_coalesce Pragma)
Coalescing nested loops also reduces the latency of the component, which could further reduce your component area usage. However, in some cases, coalescing loops might lengthen the critical loop initiation interval path, so coalescing loops might not be suitable for all components.
#pragma loop_coalesce <loop_nesting_level>
The <loop_nesting_level> parameter is optional and is an integer that specifies how many nested loop levels that you want the compiler to attempt to coalesce. If you do not specify the <loop_nesting_level> parameter, the compiler attempts to coalesce all of the nested loops.
for (A) for (B) for (C) for (D) for (E)
- Loop (A) has a loop nesting level of 1.
- Loop (B) has a loop nesting level of 2.
- Loop (C) has a loop nesting level of 3.
- Loop (D) has a loop nesting level of 4.
- Loop (E) has a loop nesting level of 3.
- If you specify #pragma loop_coalesce 1 on loop (A), the compiler does not attempt to coalesce any of the nested loops.
- If you specify #pragma loop_coalesce 2 on loop (A), the compiler attempts to coalesce loops (A) and (B).
- If you specify #pragma loop_coalesce 3 on loop (A), the compiler attempts to coalesce loops (A), (B), (C), and (E).
- If you specify #pragma loop_coalesce 4 on loop (A), the compiler attempts to coalesce all of the loops [loop (A) - loop (E)].
Example
The following simple example shows how the compiler coalesces two loops into a single loop.
#pragma loop_coalesce for (int i = 0; i < N; i++) for (int j = 0; j < M; j++) sum[i][j] += i+j;
int i = 0; int j = 0; while(i < N){ sum[i][j] += i+j; j++; if (j == M){ j = 0; i++; } }
6.4. Loop Unrolling (unroll Pragma)
Example code:
#pragma unroll <N> for (int i = 0; i < M; ++i) { // Some useful work }
In this example, <N> specifies the unroll factor, that is, the number of copies of the loop that the HLS compiler generates. If you do not specify an unroll factor, the HLS compiler unrolls the loop fully.
hls_register float data[N];
#pragma unroll 3
for (int i = 0; i < N; i++)
{
data[i] = function(i, a);
}
hls_register float data[N]; for (int i = 0; i < N; i += 3) { data[i + 0] = function(i + 0, a); if (i + 1 < N) { data[i + 1] = function(i + 1, a); } if (i + 2 < N) { data[i + 2] = function(i + 2, a); } }
You can find the unroll status of each loop in the high level design report (report.html).
6.5. Loop Concurrency (max_concurrency Pragma)
To achieve maximum concurrency in loops, sometimes private copies of component memory have to be created to break dependencies on the underlying hardware that prevent the loop from being fully pipelined.
- In the Details pane of the Loop Analysis report as a message that says that the maximum number of simultaneous executions has been limited to N.
- In the Bank view of your component memory in the Function Memory Viewer, where it graphically shows the number of private copies.
#pragma max_concurrency 1 for (int i = 0; i < N; i++) { int arr[M]; // Doing work on arr }
You can control the number of private copies created for a component memory accessed withing a loop by using the hls_private_copies memory attribute. For details, see hls_private_copies Memory Attribute.
You can also control the concurrency of your component by using the hls_max_concurrency component attribute. For more information about the hls_max_concurrency(N) component attribute, see Concurrency Control (hls_max_concurrency Attribute).
6.6. Loop Iteration Speculation (speculated_iterations Pragma)
Typically, the exit condition for a loop iteration must be evaluated before the program determines whether to start the next loop iteration or continue into the rest of the function. This requirement means that the loop initiation interval (II) cannot be lower than the number of cycles required to compute the exit condition. Speculated iterations can help lower the loop II because operations within the loop can occur in the function pipeline at the same time as the exit condition is evaluated.
For any speculated iteration, instructions with side effects outside of the loop (like writing to memory or a stream) are not completed until the loop exit condition for the iteration has been evaluated. For loop iterations that are in flight but incomplete when the loop exit condition is met, side effect data is discarded.
The Intel® HLS Compiler determines the number of speculated iterations on a per-loop basis. You can see the number of speculated iterations for a loop in the Loop Analysis Report in the High Level Design Report (report.html).
While speculated iterations can improve loop II, they occupy the pipeline until they are completed. A new loop invocation cannot start until all of the speculated iterations have completed. For example, the next iteration of an outer loop cannot start until all the speculated iterations of an inner loop have completed.
For loops where the exit condition calculation is a bottleneck (as shown in the Loop Analysis Report), consider increasing the number of speculated iterations with the speculated_iterations pragma. Increasing the number of speculated iterations might not improve the loop II if other bottlenecks in the loop are found.
For frequently invoked loops with a low latency loop body (for example, an inner loop with a short trip count), you might want to use the speculated_iterations pragma to reduce the number of speculated iterations to reduce the overhead of your design. However, setting the number of speculated iterations too low might increase the loop II because there is not enough time to evaluate the exit condition.
The following example shows how you can change the characteristics of a pipelined loop with the speculated_iterations pragma.
include <HLS/hls.h> component void unopt_int_cube_root (int *dst, int N) { int m = 0; // The exit condition which has 2 multiplies and a compare is most critical // in loop feedback path. The compiler choice of 4 speculated iterations // results in II=2 because the exit condition takes 7 cycles: each // multiplication takes 3 cycles and the comparison takes 1 cycle. Four // speculated iterations times two-cycle II gives 8 cycles to cover this // evaluation. while (m*m*m < N) { m += 1; } dst[0] = m; } component void opt_int_cube_root (int *dst, int N) { int m = 0; // Increasing to 7 speculated iterations to cover the 7 cycle exit condition // calculation allows us to achieve II=1 #pragma speculated_iterations 7 while (m*m*m < N) { m += 1; } dst[0] = m; } component void unopt2_int_cube_root (int *dst, int N) { int m = 0; // by setting to pragma to 0, user can verify that the II has increased to 7 // which matches the exit condition bottleneck #pragma speculated_iterations 0 while (m*m*m < N) { m += 1; } dst[0] = m; }
The Loop Analysis Report for these components looks like the following example:
When you click the line with unopt2_int_cube_root.B2 (spec.cpp:31) in the Loop Analysis
Report, the Details pane shows the following information:
6.7. Loop Pipelining Control (disable_loop_pipelining Pragma)
When the loop iterations effectively execute sequentially due to loop-carried dependencies, use the disable_loop_pipelining pragma to generate a simple sequential datapath and avoid loop resource hardware duplication. The simpler datapath and lack of resource duplication in hardware reduces the FPGA area utilization of your component.
Use the Loop Analysis section of the high-level design reports (report.html) to help determine if you should apply this pragma to your loops.
#pragma disable_loop_pipelining for (int i = 1; i < N; i++) { int j = a[i-1]; // Memory dependency induces a high-latency loop feedback path a[i] = foo(j) }
You can also disable pipelining the datapath of your entire component with the hls_disable_component_pipelining component attribute. For more information about this attribute, see Component Pipelining Control (hls_disable_component_pipelining Attribute).
6.8. Loop Interleaving Control (max_interleaving Pragma)
- Terminology Reminder
- A loop iteration is the single execution of a loop body. A loop invocation is the start of pipelined execution of loop iterations.
Interleaving Example 2
// Loop j is pipelined with ii=1 for (int j = 0; j < M; j++) { int a[N]; // Loop i is pipelined with ii=2 for (int i = 1; i < N; i++) { a[i] = foo(i) } }
In this example, the inner loop i is pipelined with II=2. Under normal pipelining, this II means that the inner loop hardware only achieves 50% utilization, since one iteration of the i loop is initiated every other cycle. To take advantage of these idle cycles, the compiler interleaves a second invocation of the i loop from the next iteration of the outer j loop.
Because the i loop resides inside the j loop, and the j loop has a trip count of M, the i loop is invoked M times. The j loop is the outermost loop and is invoked once.
The following table shows the difference between normal pipelined execution of the i loop versus interleaved execution for this example for N=5.
Cycle | Pipelined Loop Iterations (j loop, i loop) |
Interleaved Loop Iterations (j loop, i loop) |
---|---|---|
0 | (0,0) | (0,0) |
1 | --- | (1,0) |
2 | (0,1) | (0,1) |
3 | --- | (1,1) |
4 | (0,2) | (0,2) |
5 | --- | (1,2) |
6 | (0,3) | (0,3) |
7 | --- | (1,3) |
8 | (0,4) | (0,4) |
9 | --- | (1,4) |
10 | (1,0) | (2,0) |
11 | --- | (3,0) |
12 | (1,1) | (2,1) |
13 | --- | (3,1) |
14 | (1,2) | (2,2) |
15 | --- | (3,2) |
16 | (1,3) | (2,3) |
17 | --- | (3,3) |
18 | (1,4) | (2,4) |
19 | --- | (3,4) |
This table shows the values (j,i) for each inner loop iteration that is initiated at each cycle. At cycle 0, both modes of execution initiate the (0,0)th iteration of the i loop. Under normal pipelined execution, no i loop iteration is initiated at cycle 1. Under interleaved execution, the (1,0)th iteration of the innermost loop, i.e. the first iteration of the next (j=1) invocation of the i loop, is initiated. By cycle 10, interleaved execution has initiated all of the iterations of both the j=0 invocation of the i loop, and the j=1 invocation of the i loop. This represents twice the efficiency of the normal pipelined execution.
Sometimes you might determine that this interleaving does not give you a performance benefit relative to the additional FPGA area needed to enable interleaving. In these cases, you can limit or restrict the amount of interleaving to reduce FPGA area utilization.
Using the max_interleaving Pragma
To limit the number of interleaved invocations of an inner loop that can be executed simultaneously, annotate the inner loop with the max_interleaving pragma. The annotated loop must be contained inside another pipelined loop.
The required parameter ( n) specifies an upper bound on the degree of interleaving allowed, That is, how many invocations of the containing loop can execute the annotated loop at a given time.
-
#pragma max_interleaving 1
The compiler restricts the annotated (inner) loop to be invoked only once per outer loop iteration. That is, all iterations of the inner loop travel the pipeline before the next invocation of the inner loop can occur.
-
#pragma max_interleaving 0
The compiler allows the pipeline to contain a number simultaneous invocations of the inner loop equal to the loop initiation interval (II) of the inner loop. For example, an inner loop with an II of 2 can have iterations from two invocations in the pipeline at a time.
This behavior is the default behavior for the compiler if you do not specify the max_interleaving pragma.
// Loop j is pipelined with ii=1 for (int j = 0; j < M; j++) { int a[N]; // Loop i is pipelined with ii=2 #pragma max_interleaving 1 for (int i = 1; i < N; i++) { a[i] = foo(i) } … }
6.9. Loop Fusion
Loop fusion is a compiler transformation in which two adjacent loops are merged into a single loop over the same index range. This transformation is typically applied to reduce loop overhead and improve run-time performance.
Unfused Loops | Fused Loop |
---|---|
for (j = 0; j < 300; j++) a[j] = a[j] + 3; for (k = 0; k < 300; k++) b[k] = b[k] + 4; |
for (f = 0; f < 300; f++) { a[f] = a[f] + 3; b[f] = b[f] + 4; } |
Loop control structures represent a significant overhead. By fusing two loops, the number of control structures needed for the loops is reduced from two to one, reducing this overhead. The main goal of reducing the number of control structures is save FPGA area for your design while still maintaining (ideally increasing) component throughput.
Fusing outer loops introduces parallelism where there was previously none. Combining bodies of two adjacent loops (Lj and Lk) forms a single loop (Lf) with a loop body that spans the bodies of Lj and Lk. This combined loop body creates an opportunity for operations that were serialized across a given iteration of Lj and Lk to execute in parallel In effect, the two loops now execute in lockstep as a single loop, which provides latency improvements.
If inner loops are fused, parallelism is already achieved by pipelined execution of the outer loop iteration. In these cases, the parallelism effect of loop fusion is diminished.
Fusion Criteria
- The loops must be adjacent.
That is, you cannot have a statement Si with side-effects such that Si executes after Lj and before Lk.
- Each loop must have a single-entry point and a single exit point.
For example, loops that contain break statements are not considered for fusion.
- The loops must have no negative-distance dependencies.
That is, for loops Lj and Lk where Lj is defined before Lk, iteration m of loop Lk does not depend on values calculated in iteration m+n (where n>0) of loop Lj.
Automatic Loop Fusion
The Intel® HLS Compiler fuses adjacent loops with equal trip counts automatically if the compiler analysis of your component determines that fusing the loops is profitable.
- One of the two loops, but not both, is annotated with the ivdep pragma.
- One of the two loops, but not both, contains stall-free logic.
The Loop Analysis Report in the High-Level Design Reports indicates when loops were fused.
-
nofusion pragma
Annotate loops with this pragma to request that the compiler not fuse the annotated loop.
-
Override the compiler profitability analysis and fuse adjacent loops provide that it is safe.
Use the the loop_fuse pragma to tell the compiler to consider fusing adjacent loops with different trip counts.
6.9.1. Loop Fusion Control (loop_fuse Pragma)
Use the loop_fuse pragma to tell the compiler to try to fuse two adjacent loops without affecting the functionality of either loop, overriding the compiler profitability analysis of fusing the loops.
Fusing adjacent loops helps reduce the amount of loop control overhead in your component, which helps reduce the FPGA area used and can increase the performance by executing both loops as one (fused) loop.
#pragma loop_fuse [depth(N)] [independent] { ... }
#pragma loop_fuse // can also be // #pragma loop_fuse depth(1) { L1: for(...) {} L2: for(...) { L3: for(...) {} L4: for(...) { L5: for(...) {} L6: for(...) {} } } } By default (or depth(1)), only loops L1 and L2 are initially considered for fusing. |
|
#pragma loop_fuse depth(2) { L1: for(...) {} L2: for(...) { L3: for(...) {} L4: for(...) { L5: for(...) {} L6: for(...) {} } } }With depth(2), the following loop pairs are initially considered for fusing:
|
#pragma loop_fuse depth(3) { L1: for(...) {} L2: for(...) { L3: for(...) {} L4: for(...) { L5: for(...) {} L6: for(...) {} } } }With depth(3), the following loop pairs are initially considered for fusing:
|
The compiler automatically considers fusing adjacent loops with equal trip counts when the loops meet the criteria. You can also use the loop_fuse pragma to tell the compiler to consider fusing adjacent loops with different trip counts.
With the loop_fuse pragma applied to a block of code, the compiler always tries to fuse adjacent loops (with equal or different trip counts) in the block whenever the compiler determines that it is safe to fuse the loops. Two loops are considered safe to merge if they meet the fusion criteria described in Fusion Criteria section of Loop Fusion.
Unfused Loops | Fused Loop |
---|---|
#pragma loop_fuse { for (int i = 0; i < N; i++) { // Loop Body 1 } for (int j = 0; j < M; j++) { // Loop Body 2 } } |
for (int f = 0; f < max(M,N); f++) { if (f < N) { // Loop Body 1 } if (f < M) { // Loop Body 2 } } |
#pragma loop_fuse { L1 for(...) {} L2 for(...) { L3 for(...) {} } L4 for(...) {} }
Use the independent option to override the dependency safety checks. If you specify the independent option, you are guaranteeing to the compiler that fusing pairs of loops affected by the loop_fuse pragma is safe. That is, there are no negative-distance dependencies in the fused loop. If it is not safe, you might get functional errors in your component.
Function Calls In loop_fuse Code Blocks
If a function call occurs in a code block annotated with the loop_fuse pragma and inlining that function call contains a loop, the resulting loop can be a candidate for loop fusion.
Nested depth(N) Clauses
When you nest loop_fuse pragmas, you might create overlapping sets of candidates loops.
#pragma loop_fuse depth(2) independent { L1: for(...) {} L2: for(...) { #pragma loop_fuse depth(2) { L3: for(...) {} L4: for(...) { L5: for(...) {} L6: for(...) {} } } } }In this example, the compiler considers the following loop pairs for fusion: L1/L2, L3/L4, and L5/L6. In addition, the compiler overrides the compiler negative-distance dependency analysis of the following loops pairs: L1/L2, L3/L4.
6.9.2. Loop Fusion Exemption (nofusion pragma)
You can exempt a loop from being considered for fusing with an adjacent loop by annotating the loop with the nofusion pragma. This pragma prevents the annotated loop from being automatically fused or fused when it is subject to the loop_fuse pragma.
#pragma nofusion for (...) { loop body }
Applying the nofusion pragma to one of the loops in a pair prevents the loops from being fused.
#pragma nofusion L1: for (int j=0; j < N; ++j){ data[j] += Q; } L2: for (int i = 0; i < N; ++l) { output[i] = Q * data[i]; } |
L1: for (int j=0; j < N; ++j){ data[j] += Q; } #pragma nofusion L2: for (int i = 0; i < N; ++l) { output[i] = Q * data[i]; } |
7. Component Concurrency
The Intel® HLS Compiler provides you with the hls_max_concurrency component attribute to help you control the maximum concurrency of your component.
7.1. Serial Equivalence within a Memory Space or I/O
When visualizing a single shared memory space, think of multiple function calls as executing sequentially, one after another. This way, when the component asserts the done signal, the results of a component invocation in hardware are guaranteed to be visible to both the next component invocation and the external system.
The HLS compiler takes advantage of pipeline parallelism to execute component invocations and loop iterations in parallel if the associated dependencies allow for parallel execution. Because the HLS compiler generates hardware that keeps track of dependencies across component invocations, it can support pipeline parallelism while guaranteeing serial equivalence across memory spaces. Ordering between independent I/O instructions is not guaranteed.
7.2. Concurrency Control (hls_max_concurrency Attribute)
You can use the hls_max_concurrency component attribute to increase or limit the maximum concurrency of your component. The concurrency of a component is the number of invocations of the component that can be in progress at one time. By default, the Intel® HLS Compiler tries to maximize concurrency so that the component runs at peak throughput.
You can control the maximum concurrency of your component by adding the hls_max_concurrency component attribute immediately before you declare your component, as shown in the following example:
#include "HLS/hls.h" hls_max_concurrency(3) component void foo ( /* arguments */ ){ // Component code }
The optimizations caused by using this attribute might cause component memory configuration changes to meet the set concurrency requirements. Use memory attributes to control the geometry of your component memory configuration.
- You have a component memory system.
- At the component level, the Intel HLS compiler does not
automatically create private copies of component memory to increase the throughput. If
your component invocation uses a non-static component memory system, the next invocation
cannot start until the previous invocation has finished all its accesses to and from
that component memory.
This limitation is shown in the Loop Analysis report as load-store dependencies on the component memory.
Adding the hls_max_concurrency(N) attribute to the component creates private copies of the component memory so that you can have multiple pipelined invocations of your component in progress at the same time. To create as many private copies as necessary for maximal performance, use hls_max_concurrency(0).
For finer-grained control of which component memories to create private copies of, use the hls_private_copies memory attribute. For details, see hls_private_copies Memory Attribute.
- The compiler determines that reducing concurrency saves FPGA area.
- In some cases, the compiler reduces concurrency to save FPGA area.
In these cases, the hls_max_concurrency(N) component attribute can increase the
concurrency from 1.The Loop Analysis report displays the concurrency for a function in the Details pane of the report when you click the function marked with (Component invocation) in the Loop Analysis pane. If your design concurrency is limited, the Details pane shows a line like the following line:
Maximum concurrent iterations: 1 is the default for component invocation loop.
The hls_max_concurrency attribute can also accept a value of 0. When this attribute is set to 0, the component should be able to accept new invocations as soon as the downstream datapath frees up. Use this value only when you see loop initiation interval (II) issues in your component because using this attribute can increase the component area. You can find loop II issues by examining the Loop Viewer in the High-Level Design Reports or looking for extra bubbles that are visible in a simulation waveform.
You can also control the concurrency of loops in components with the max_concurrency(N) pragma. For more information about the max_concurrency(N) pragma, see Loop Concurrency (max_concurrency Pragma).
7.3. Component Pipelining Control (hls_disable_component_pipelining Attribute)
If running concurrent invocations of your component does not improve throughput, or if you do not intend to invoke your component repeatedly, avoid extra FPGA area utilization by using the hls_disable_component_pipelining component attribute.
When you specify the hls_disable_component_pipelining, the Intel® HLS Compiler generates a simpler, serialized datapath for your component.
#include "HLS/hls.h" hls_disable_component_pipelining component void baz ( /* arguments */ ){ // component code }
You can also disable pipelining the datapath of a loop in your component with the disable_loop_pipelining pragma. For more information about this pragma see Loop Pipelining Control (disable_loop_pipelining Pragma).
Review the Loop Analysis report in the High-Level Design Reports
(report.html) to see component invocations and loops with pipelining
disabled:
8. Arbitrary Precision Math Support
Some of these header files are based on the Algorithmic C (AC) data types that Mentor Graphics* provides under the Apache license. For more information about the Algorithmic C data types, refer to Mentor Graphics Algorithmic C (AC) Datatypes, which is available as a part of your Intel® HLS Compiler installation: <quartus_installdir>/hls/include/ref/ac_datatypes_ref.pdf.
The Intel® HLS Compiler also supports arbitrary-precision IEEE 754 compliant floating point data types that is not based on the AC data types.
Data Type | Intel Header File | Description |
---|---|---|
ac_int | HLS/ac_int.h | Arbitrary-width integer support To learn more, review the following
tutorials:
|
ac_fixed | HLS/ac_fixed.h | Arbitrary-precision fixed-point number
support To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/ac_datatypes/ac_fixed_constructor |
HLS/ac_fixed_math.h | Support for some nonstandard math
functions for arbitrary-precision fixed-point data types To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/ac_datatypes/ac_fixed_math_library |
|
ac_complex | HLS/ac_complex.h | Complex number support |
hls_float | HLS/hls_float.h | Arbitrary-precision floating-point number support |
HLS/hls_float_math.h | Support for commonly used exponential,
logarithmic, power, and trigonometric functions. To learn more, review the following
tutorials:
|
-
ac_fixed data type
Include the HLS/ac_fixed_math.h header file
-
hls_float data type
Include the HLS/hls_float_math.h header file
Advantages of Arbitrary Precision Data Types
The arbitrary precision data types have the following advantages over using standard C/C++ data types in your components:
- You can achieve narrower data paths and processing elements for various operations in the circuit.
- The data types ensure that all operations are carried out in a size guaranteed not to lose any data. However, you can still lose data if you store data into a location where the data type is too narrow.
Limitations of AC Data Types
The AC data types have the following limitations:
- Multipliers are limited to generating 512-bit results.
- Dividers for ac_int data types are limited to a maximum of 128 bit unsigned or 127 bit signed.
- Dividers for ac_fixed data types are limited to a maximum of 64 bits (unsigned or signed).
- The FPGA-optimized header files provided by the
Intel® HLS Compiler are not compatible with GCC or MSVC. When you use the
Intel® HLS Compiler header files, you cannot use GCC or MSVC to
compile your testbench. Both your component and testbench must be compiled with the
Intel® HLS Compiler.
To compile AC data types with GCC or MSVC, use the reference AC data types headers also provided with he Intel® HLS Compiler. For details, see AC Data Types and Native Compilers.
Limitations of the Intel® HLS Compiler Arbitrary Precision Floating Point Data Type
- Floating point optimization into constants performed for float and double data types is not done for the hls_float data type.
-
A limited set of math functions is supported. For details, see Operators and Return Types Supported by the hls_float Data Type.
- The hls_float header files provided by the Intel® HLS Compiler are not compatible with GCC or MSVC. When you use the Intel® HLS Compiler header files, you cannot use GCC or MSVC to compile your testbench. Both your component and testbench must be compiled with the Intel® HLS Compiler.
- The high-level design reports do not show bit widths for the hls_float data type.
-
Constant initialization works only with the round-towards-zero (RZERO) rounding mode.
8.1. Declaring ac_int Data Types
-
Include the ac_int.h header
file in your component in the following manner:
#ifdef __INTELFPGA_COMPILER__ #include "HLS/ac_int.h" #else #include "ref/ac_int.h" #endif
-
After you include the header file, declare your ac_int variables in one of the following
ways:
- Template-based declaration
- ac_int<N, true> var_name; //Signed N bit integer
- ac_int<N, false> var_name; //Unsigned N bit integer
- Predefined types up to 63 bits
- intN var_name; //Signed N bit integer
- uintN var_name; //Unsigned N bit integer
Where N is the total length of the integer in bits.Restriction:If you want to initialize an ac_int variable to a value larger than 64 bits, you must use the bit_fill or bit_fill_hex utility function. For details see "2.3.14 Methods to Fill Bits" in Mentor Graphics Algorithmic C (AC) Datatypes, which is available as <quartus_installdir>/hls/include/ref/ac_datatypes_ref.pdf.
The following code example shows the use of the bit_fill or bit_fill_hex utility functions:typedef ac_int<80,false> i80_t; i80_t x; x.bit_fill_hex(“a9876543210fedcba987”); // member funtion x = ac::bit_fill_hex<i80_t>(“a9876543210fedcba987”); // global function int vec[] = { 0xa987, 0x6543210f, 0xedcba987 }; x.bit_fill(vec); // member function x = bit_fill<i80_t>(vec); // global function // inlining the constant array x.bit_fill( (int [3]) { 0xa987,0x6543210f,0xedcba987 } ); // member function x = bit_fill<i80_t>( (int [3]) { 0xa987,0x6543210f,0xedcba987 } ); // global function
- Template-based declaration
For a list of supported operators and their return types, see "Chapter 2: Arbitrary-Length Bit-Accurate Integer and Fixed-Point Datatypes" in Mentor Graphics Algorithmic C (AC) Datatypes, which is available in the following file: <quartus_installdir>/hls/include/ref/ac_datatypes_ref.pdf.
8.1.1. Important Usage Information on the ac_int Data Type
The ac_int datatype has a large number of API calls that are documented in Mentor Graphics Algorithmic C (AC) Datatypes , which is available In your Intel® HLS Compiler installation as the following file: <quartus_installdir>/hls/include/ref/ac_datatypes_ref.pdf. For more information on AC datatypes, refer to
The ac_int datatype automatically increases the size of the result of the operation to guarantee that the intermediate operations never overflow. However, the HLS compiler automatically truncates or extends the result to the size of the specified destination container, so ensure that the storage variable for your computation is large enough.
The HLS compiler installation package includes a number of examples in the tutorials. Refer to the tutorials in <quartus_installdir>/hls/example/tutorials/ac_datatypes for some of the recommended practices.
8.2. Integer Promotion and ac_int Data Types
The rules of integer promotion when you use ac_int data types are different from the rules of integer promotion for standard C/C++ rules. Your component design should account for these differing rules.
- Both operands are standard integer types (int, short, long, unsigned char, or signed char):
If both operands are of standard integer type (for example char or short) operations, integers can be promoted following the C/C++ standard. That is, the operation is carried out in the data type and size of the largest operand, but at least 32 bits. The expression returns the result in the larger data type.
- Both operands are ac_int data types:
If both operands are ac_int data types, operations are carried out in the smallest ac_int data type needed to contain all values. For example, the multiplication of two 8-bit ac_int values is carried out as an 16-bit operation. The expression returns the result in that type.
-
One operand is a standard integer type and one operand is an ac_int type:
If the expression has one standard data type and one ac_int type, the rules for ac_int data type promotion apply. The resulting expression type is always an ac_int data type. For example, if you add a short data type and an ap_int<16> data type, the resulting data type is ac_int<17>.
Data Width of Operations
C++ compilers typically automatically promote narrow integer types such as char and short to 32-bit types (int) for arithmetic operations such as addition, multiplication, division, and bit-shifts. To adhere to the C++ language specification and be consistent with other C++ compilers, the Intel® HLS Compiler might use larger operations than you might expect when dealing with native types.
If you need better control over the size of arithmetic operations, use the ac_int datatype.
component int singlestep_native_signed(char a, char b, char c) { return a * b / c; }
The Graph Viewer (part of the High-Level Design
Reports) shows that the divide operation becomes a 32-bit
operation:

#include <HLS/ac_int.h> component int32 singlestep_acint_signed(int8 a, int8 b, int8 c) { return a * b / c; }
With the component variables all defined as ac_int data types, the Graph Viewer now shows
that the divide operation is reduced to being 17-bits wide:

Literals in Operations
ac_int<5, true> ap; ... if (ap < 4) { ...
If the operands are signed differently and the unsigned type is at least as large as the signed type, the operation is carried out as an unsigned operations. Otherwise, the unsigned operand is converted to a signed operand.
uint3 x = 7; if (x != -1) { // FAIL }
uint3 x = 7; if (x != (unit3)-1) { // SUCCEED }
8.3. Debugging Your Use of the ac_int Data Type
When you use the DEBUG_AC_INT_WARNING and DEBUG_AC_INT_ERROR macros, you cannot declare constexpr ac_int variables or constexpr ac_int arrays.
Tool | Description |
---|---|
DEBUG_AC_INT_WARNING | Emits a warning for each detected overflow. |
DEBUG_AC_INT_ERROR |
Emits a message for the first overflow that is detected and then exits the component with an error. |
Review the ac_int_overflow tutorial in <quartus_installdir>/hls/example/tutorials/ac_datatypes to learn more.
8.4. Declaring ac_fixed Data Types
-
Include the ac_fixed.h header file
in your component in the following manner:
#ifdef __INTELFPGA_COMPILER__ #include "HLS/ac_fixed.h" #else #include "ref/ac_fixed.h" #endif
-
After you include the header file, declare your ac_fixed variables as follows:
- ac_fixed<N, I, true, Q, O> var_name; //Signed fixed-point number
- ac_fixed<N, I, false, Q, O> var_name; //Unsigned fixed-point number
Where the template attributes are defined as follows:- N
- The total length of the fixed-point number in bits.
- I
- The number of bits used to represent the integer value of the
fixed-point number.
The difference of N−I determines how many bits represent the fractional part of the fixed-point number.
- Q
- The quantization mode that determines how to handle values where
the generated precision (number of decimal places) exceeds the number of bits
available in the variable to represent the fractional part of the number.
For a list of quantization modes and their descriptions, see "2.1. Quantization and Overflow" in Mentor Graphics Algorithmic C (AC) Datatypes, which is available in the following file: <quartus_installdir>/hls/include/ref/ac_datatypes_ref.pdf.
- O
- The overflow mode that determines how to handle values where
the generated value has more bits than the number of bits available in the
variable.
For a list of overflow modes and their descriptions, , see "2.1. Quantization and Overflow" in Mentor Graphics Algorithmic C (AC) Datatypes, which is available in the following file: <quartus_installdir>/hls/include/ref/ac_datatypes_ref.pdf.
For a list of supported operators and their return types, see "Chapter 2: Arbitrary-Length Bit-Accurate Integer and Fixed-Point Datatypes" in Mentor Graphics Algorithmic C (AC) Datatypes, which is available in the following file: <quartus_installdir>/hls/include/ref/ac_datatypes_ref.pdf.
Additional math functions are supported by the HLS/ac_fixed_math.h header file. For details, see Math Functions Provided by the ac_fixed_math.h Header File.
8.5. Declaring ac_complex Data Types
-
Include the ac_complex.h header file
in your component in the following manner:
#ifdef __INTELFPGA_COMPILER__ #include "HLS/ac_complex.h" #else #include "ref/ac_complex.h" #endif
-
After you include the header file, declare your ac_complex variables according to the data type of your complex number.
The underlying data type can be ac_int, ac_fixed, hls_float, and standard C integer or floating-point data types.
For a list of supported operators and their return types, see "4. Complex Datatype" in Mentor Graphics Algorithmic C (AC) Datatypes, which is available in the following file: <quartus_installdir>/hls/include/ref/ac_datatypes_ref.pdf.
8.6. AC Data Types and Native Compilers
The reference version of the Mentor Graphics* Algorithmic C (AC) data types is also provided with the Intel® HLS Compiler. Do not use these reference header files in your component if you want to compile your component with an FPGA target.
Use the reference header files for AC data types to confirm functional correctness in your component when you are compiling your component with native compilers (g++ or MSVC).
If you use the reference header files and compile your component to an FPGA target, your component can compile successfully but your component QoR will be poor.
All of your code must use the same header files (either the reference header files or the FPGA-optimized header files). For example, your code cannot use the reference header files in your testbench and, at the same time, use the FPGA-optimized header file in your component code.
AC data type | Reference Header File | Description |
---|---|---|
ac_int | ref/ac_int.h | Arbitrary width integer support |
ac_fixed | ref/ac_fixed.h | Arbitrary precision fixed-point number support |
ac_complex | ref/ac_complex.h | Arbitrary precision complex number support |
8.7. Declaring hls_float Data Types
The Intel® HLS Compiler Pro Edition includes the hls_float.h header file for arbitrary-precision floating-point number support. The floating-point representation for hls_float data types adopts the same IEEE standard as native C++ float and double types.
The hls_float.h header file does not work with native compilers (g++ or MSVC).
An hls_float variable carries an explicit sign bit and an arbitrary number of bits for the exponent and mantissa.
Due to the differences in the internal math implementations and rounding errors, the results from hls_float operations might not always be bit-accurate to those produced by C++ native floating-point types with the same exponent and mantissa bit widths.-
Include the hls_float.h header file in your component in
the following manner:
#include "HLS/hls_float.h"
-
After you include the header file, declare
your hls_float variables as
follows:
hls_float<exponent_width, mantissa_width[,rounding_mode]>
The hls_float.h header file also provides aliases so you can declare bfloat16 and bfloat19 data types directly.Where the template attributes are defined as follows:- exponent_width, mantissa_width
- The bit-width of the
exponent and mantissa of the floating-point
variable.The hls_float data type supports the following exponent_width, mantissa_width combinations:Some of these width combinations map to some commonly used floating-point formats:
Table 20. Exponent- and Mantissa-Width Combinations Supported by the hls_float Data Type 5, 10 8, 7 8 , 10 8 , 17 8 , 23 8, 26 10, 35 11, 44 11, 52 15, 63 - rounding_mode
- Optional parameter to
specify the IEEE 754 rounding mode used when
converting between data types.Set the rounding mode with one of the following values:
-
ihc::fp_config::FP_Round::RNE
Round to nearest, tie to even
This rounding mode is more accurate (0.5 ULP), but requires more FPGA area.
-
ihc::fp_config::FP_Round::RZERO
Round towards zero
This rounding mode is less accurate (1 ULP) and requires less FPGA area.
-
ihc::fp_config::FP_Round::RNE
The hls_float data type supports a limited set of math operations. For details, see Operators and Return Types Supported by the hls_float Data Type.
80-bit extended precision has one explicit bit of fraction that is dropped when converting it to hls_float<15,63>.
8.7.1. Operators and Return Types Supported by the hls_float Data Type
The hls_float data type supports all overloaded math operators and a limited set of the math functions provided by the Intel® HLS Compiler Pro Edition. For some math operators, you can control the precision of the output by using templated versions of the functions.
Supported Math Functions
- Exponential and logarithmic functions
5:
- ln, log2, log10, ln(1+x)
- e x , 2 x , 10 x ,e x −1
- Power functions5:
- pow, powr, pown
- Trigonometric functions5:
- sin, cos, sincos
- sinpi, cospi
- asin, asinpi, acos, acospit, atan, atanpi, atan2
Conversion Rules
hls_float<8, 32> myFloat = ...; hls_float<3, 18> myFloat2 = myFloat; // use rounding rules defined by hls_float type hls_float <3, 18>myFloat3 = myFloat.convert_to<3, 18, ihc::fp_config::FP_Round::RZERO>(); // use rounding rules defined in convert_to() function call
To convert between native types (for example, float, double) and hls_float data types, assign to or from the types. Type conversion in an assignment occurs according to the rules in the Default Conversion Rules for hls_float Variables table that follows.
For two hls_float variables in a binary operation, the hls_float variable with the larger exponent bit-width is considered to be the "larger" variable. If the two variables have the same exponent bit width, the variable with the larger mantissa bit-width is considered to be the larger variable. The operands are then unified to the "larger" type before the binary operation occurs.
Native floating point data types and hls_float data types are converted to hls_float data types according to the rules in the Default Conversion Rules for hls_float Variables table that follows.
The Intel® HLS Compiler also provides some operations that leave the precision of input types untouched and provide control over the output precision. For details, see Operations With Explicit Precision Controls.
Data Type | From hls_float To Data Type | From Data Type To hls_float |
---|---|---|
hls_float with higher representable range | Keep exponent equivalent. The mantissa is rounded according to the rounding mode of the target hls_float (with the higher representable range). |
+-Inf if
the source of the conversion is out of the representable
range. Otherwise, keep exponent equivalent. The mantissa is rounded according to the rounding mode of the target hls_float (with the smaller representable range). |
float | Convert original hls_float to hls_float<8, 23> with earlier hls_float rule, then bit-cast to float | Bit-cast float to hls_float<8, 23>, and then convert to target hls_float precision using the hls_float to hls_float rules described earlier. |
double | Convert original hls_float to hls_float<11, 52> with earlier hls_float rule, then bit-cast to double | Bit-cast double to hls_float<11, 52>, and then convert to target hls_float precision using the hls_float to hls_float rules described earlier. |
long
double
(emulation only) (Linux only) |
Convert original hls_float to hls_float<15, 63> with earlier hls_float rule, then insert a 1-bit 1 to the MSB of fraction bits to get an approximate equivalent of 80-bit representation of long double | Drop the explicit 1 fraction bit to convert long double to 79-bit hls_float<15, 63> |
long
double
(emulation only) (Windows only) |
Same as double | Same as double |
C++ native integer types | Truncate towards zero Converting from hls_float that is larger than range of integer type is undefined behavior. |
Round to nearest, tie breaks to even. If the integer value is too large, the hls_float value saturates to plus infinity. |
Operations With Explicit Precision Controls
The Intel® HLS Compiler provides the following operations that leave the precision of input hls_float-type variables untouched and let you control the output precision:
- Syntax
- convert_to<output_exponent_width, output_mantissa_width, rounding_mode>
- Description
-
Use this method to override the rounding mode set for an hls_float variable when you are converting the variable to different precision.
By default, hls_float to hls_float conversions use the rounding mode that you specified when you declared the variable.
- Syntax
-
ihc::hls_float< output_exponent_width, output_mantissa_width > ::mul
<accuracy_setting],
[subnormal_setting]>
(hls_float_a, hls_float_b)
Where the optional parameters are defined as follows:
- subnormal_setting
- Optional parameter to specify whether
input and output number are flushed to zero when
carrying out basic binary operations explicitly.Set this parameter with one of the following values:
-
ihc::fp_config::FP_Subnormal::ON
Input and output numbers in the subnormal range are preserved.
The target FPGA device must have subnormal support,
Subnormal support might require more FPGA area.
-
ihc::fp_config::FP_Subnormal::OFF
Input or output numbers in the subnormal range are flushed to zero.
-
ihc::fp_config::FP_Subnormal::AUTO
With this setting, the Intel® HLS Compiler enables subnormal support only when it is directly supported by the target FPGA device and it does incur any extra FPGA area overhead.
-
ihc::fp_config::FP_Subnormal::ON
- accuracy_setting
- Optional parameter that influences
trade-offs between the accuracy of the result due to
different rounding decisions in the intermediary
calculations and the FPGA area utilized by the generated
hardware. Floating-point operations with less accurate
results typically use fewer logic elements.
For example, a divider with a high accuracy might use 20% more FPGA area than divider with low accuracy. The low accuracy divider has a higher error bound [1 unit of least precision (ULP)] than a high accuracy divider (0.5 ULP).
Set this parameter with one of the following values:- ihc::fp_config::FP_Accuracy::LOW
- ihc::fp_config::FP_Accuracy::HIGH
- Description
-
This math function supplements the basic multiplication operation performed by the multiplication (*) operator.
Multiplies hls_float_a and flaot_b without changing the input types, and outputs an hls_float at the specified precision.
- Syntax
-
ihc::hls_float< output_exponent_width, output_mantissa_width > ::add
<[optional
parameters]> (hls_float_a, hls_float_b)
ihc::hls_float< output_exponent_width, output_mantissa_width > ::sub <[optional parameters]> (hls_float_a, hls_float_b)
ihc::hls_float< output_exponent_width, output_mantissa_width > ::div <[optional parameters]> (hls_float_a, hls_float_b)
- Description
-
These math functions supplement the basic math operations performed by the addition/subtraction/division (+/ −//) operators.
Adds/Subtracts/Divides hls_float_a and hls_float_b by first casting hls_float_a and hls_float_b to the specified hls_floatprecision. The operation and output are at the specified precision.
You can also specify the optional parameters that are the accuracy_setting and subnormal_setting parameters described earlier.
Comparison Operators
Comparison operators (>, <, ==, !=, >=, <=) are subject to the conversion rules described earlier.
The == and != operators impose a bit-wise comparison of the casted values.
Comparisons with NaN always return false.
Additional hls_float Functions
Function | Description |
---|---|
Getters and Setters | |
hls_float::get_exponent
hls_float::set_exponent |
Gets/sets the exponent value of the hls_float variable. |
hls_float::get_mantissa
hls_float::set_mantissa |
Gets/sets the mantissa value of the hls_float variable. |
hls_float::get_sign
hls_float::set_sign |
Gets/sets the sign bit of the hls_float variable. |
Special Constants | |
hls_float<e,m>::nan() | Constant used to assign the hls_float variable a value of NaN. |
hls_float<e,m>::pos_inf() | Constant used to assign the hls_float variable a value of +∞. |
hls_float<e,m>::neg_inf() | Constant used to assign the hls_float variable a value of −∞. |
Value Queries | |
hls_float::is_nan() | Returns true if the value of the hls_float variable is NaN. |
hls_float::is_inf() | Returns true if the value of the hls_float variable is ±∞. |
hls_float::is_zero() | Returns true if the value of the hls_float variable is zero. |
Special Functions | |
hls_float::next_after(next_val) | Returns the next representable value towards next_val. |
8.7.2. Additional Data Types Provided By hls_float.h
The Intel® HLS Compiler hls_float.h header files provides some aliases for some hls_float data types that you can use instead of explicitly declaring an hls_float data type.
The bfloat16 Data Type
The bfloat16 data type is a 16-bit floating point number with an 8-bit exponent and a 7-bit mantissa (equivalent to declaring hls_float<8.7>).
On Intel® Agilex™ devices, dot product operations that involve the bfloat16 (or hls_float<8.7>) data type are mapped to FP16 digital signaling blocks (DSPs). On other device families, dot product operations are mapped to adaptive logic modules (ALMs) and fixed-point 18-bit DSPs.
On all device families, all other math functions are mapped to ALMs and fixed-point 18-bit DSPs.
The bfloat19 Data Type
The bfloat19 data type is a 19-bit floating point number with an 8-bit exponent and a 10-bit mantissa (equivalent to declaring hls_float<8.10>).
On Intel® Agilex™ devices, dot product operations that involve the bfloat19 (or hls_float<8.10>) data type are mapped to FP19 digital signaling blocks (DSPs). On other device families, dot product operations are mapped to adaptive logic modules (ALMs) and fixed-point 18-bit DSPs.
On all device families, all other math functions are mapped to ALMs and fixed-point 18-bit DSPs.
9. Component Target Frequency
For details about the --clock option, see Command Options Affecting Compiling.
For details about the hls_scheduler_target_fmax_mhz component attribute, see hls_scheduler_target_fmax_mhz Component Attribute.
- The --clock option applies to all components compiled with the invocation of the i++ command that contains the --clock option.
- The hls_scheduler_target_fmax_mhz component attribute applies only to the component or task function that has the attribute.
<quartus_installdir>/hls/examples/tutorials/best_practices/set_component_target_fmax
component int test1(){ … } hls_scheduler_target_fmax_mhz(200) component int test2(){ … }
The compiler schedules component test1 at 300 MHz (from the command option) and component test2 at 200 MHz (from the component attribute).
- Important!
-
Setting the target fMAX determines the pipelining effort at the compilation stage. Compiling with Quartus Prime software reports the achievable fMAX value for your components. This value is often different from the value you specified.
You can lower the --clock value to reduce the latency of your design at the expense of reducing the fMAX of your component.
9.1. Effects of Specifying Target II and Target fMAX
Setting a target II (through hls_component_ii component attribute or #pragma ii loop pragma) and setting a target fMAX (through the hls_scheduler_target_fmax_mhz component attribute) affects how the scheduler in the Intel® HLS Compiler determines its efforts.
The following table summarizes the behavior of the scheduler in the Intel® HLS Compiler.
Set Target fMAX | Set Target II | Compiler Behavior |
---|---|---|
No | Yes | Best effort to achieve the II for the corresponding loop (may not achieve the best possible fMAX). |
Yes | No | Best effort to achieve fMAX specified (may not achieve the best possible II). |
Yes | Yes | Best effort to achieve the fMAX specified at the given II. The compiler errors out if it cannot achieve the requested II. |
No | No | Use heuristic to achieve best fMAX/II trade-off. |
If you are using an fMAX target in the command line or a component attribute, specify #pragma ii for performance-critical loops in your design.
10. Systems of Tasks
The component keyword marks a single function and its subfunctions as a component. Within this component function, directly-called functions are in-lined while functions that use the systems of tasks API calls (ihc::launch and ihc::collect) generate hardware outside the component datapath and behave like an asynchronous call.
The function tagged with the component keyword marks the boundary of a system of tasks. Your external system can interact with all the interfaces that the component exposes.
- Improving the performance of operations like executing loops in parallel
- Reducing FPGA area utilization by sharing an expensive compute block with different parts of your component
Function | Description |
---|---|
ihc::launch | Marks a function as an Intel® HLS Compiler task for hardware generation, and launches the task function asynchronously. |
ihc::collect | Synchronizes the completion of the specified task function in the component. |
ihc::stream | Allows streaming communication between different task functions. |
ihc::launch_always_run | Launches a task function at
component power-on or reset and continuously
executes the function. Recommendation: Use the
ihc_hls_set_component_wait_cycle
with this function to keep your
component and always-run task functions
correctly coordinated.
|
Template Object or Parameter | Description |
---|---|
ihc::stream | Streaming interface to the component or task function. |
ihc::buffer | Specifies the capacity (in words) of the FIFO buffer on the input data that associates with the stream. |
ihc::usesPackets | Exposes the startofpacket and endofpacket sideband signals on the stream interface. |
Function API | Description |
---|---|
T read() | Blocking read call to be used from within the component or task |
T read(bool& sop, bool& eop) |
Available only if usesPackets<true> is set. Blocking read with out-of-band startofpacket and endofpacket signals. |
T tryRead(bool &success) | Non-blocking read call to be used from within the component or task. The success bool is set to true if the read was valid. |
T tryRead(bool& success, bool& sop, bool& eop) |
Available only if usesPackets<true> is set. Non-blocking read with out-of-band startofpacket and endofpacket signals. |
void write(T data) | Blocking write call from the component or task. |
void write(T data, bool sop, bool eop) |
Available only if usesPackets<true> is set. Blocking write with out-of-band startofpacket and endofpacket signals. |
bool tryWrite(T data) | Non-blocking write call from the component or task. The return value represents whether the write was successful. |
bool tryWrite(T data, bool sop, bool eop) |
Available only if usesPackets<true> is set. Non-blocking write with out-of-band startofpacket and endofpacket signals.The return value represents whether the write was successful. |
10.1. Task Functions
Task Function Interfaces
- Task function arguments and return values
- Explictly-declared Avalon® Streaming (ST) interfaces
- Explictly-declared Avalon® Memory-Mapped (MM) Master interfaces
- You cannot use Avalon® MM Master interfaces (ihc::mm_master) defined at global or file scope in a component or its tasks.
- Pointer or reference data types cannot be passed to task functions as arguments unless they are a pointer or reference to an explicit Avalon® MM Master interface.
<quartus_installdir>/hls/examples/tutorials/system_of_tasks/interfaces_sot
Scalar Parameters and Return Values
Like HLS components, the scalar parameters and return value for an HLS task are implemented as conduits and the hand-shaking is implemented as a simple stall/valid handshake. The ihc::launch and ihc::collect calls connect directly to the HLS task function do and return streams.
In the High Level Design Report (report.html), the ihc::launch and ihc::collect calls appear as blocking streaming write and streaming read operations.
Interaction with External Systems
Task functions can use a global instance of the ihc::stream_in class to take an input from the external system, or a global instance of the ihc::stream_out class to provide output to the external system.
The global ihc::stream_in and ihc::stream_out streams must be declared outside of any struct variables, and they cannot be declared in arrays.
Communication Between HLS Task Functions
For two task functions to communicate with each other, connect them with a global ihc::stream object (instead of the ihc::stream_in and ihc::stream_out objects).
The global ihc::stream object must be declared outside of any struct variables, and it cannot be declared in an array.
The ihc::stream object has an API very similar to the ihc::stream_in and ihc::stream_out classes. However, since these streams always require handshaking, the API does not support the parameters ihc::usesReady or ihc::usesValid. They do support tryRead and tryWrite API functions.
The ihc::stream objects can have both of their endpoints within the system of tasks. This includes within the same function as well. For an example of using an ihc::stream within a single function as a FIFO, see the following tutorial:
<quartus_installdir>/hls/examples/tutorials/system_of_tasks/internal_stream
If an instance of the ihc::stream class has only one endpoint within the system of tasks, it is treated as if it were a ihc::stream_in or ihc::stream_out class based on its usage within the system, so it can be used interchangeably with ihc::stream_in or ihc::stream_out (provided that the limitations do not affect the design). An ihc::stream object can be used for multiple tasks to communicate with one another. See the following tutorial:
<quartus_installdir>/hls/examples/tutorials/system_of_tasks/parallel_loops
HLS Task Function Restrictions
- Task functions cannot be shared between multiple components.
- All read sites and write sites for a stream must be within the same function (component or task).
- If you implement a class member function as a task function, the member function must be static. If you want to parameterize the function behavior, use function parameters or template parameters. You cannot use instance variables of an object.
- A task function can be launched (with ihc::launch) only from one component function or task function. The launching function and the collecting function can be different functions but they must part of the same component system of tasks.
- A task function can be collected (with ihc::collect) only from one component or task function. The collecting function and the launching function can be different functions but they must part of the same component system of tasks.
- No guarantee of execution order is provided between independent I/O
instructions, even at the task level.
The ihc::launch and ihc::collect calls to a particular task function are executed in order.
Any stream accesses to that task from the current function are executed in instruction order only with respect to ihc::launch and ihc::collect calls to the corresponding function.
Task Attributes
- hls_max_concurrency
- hls_component_ii
- hls_scheduler_target_fmax_mhz
- hls_disable_component_pipelining
In addition to these function attributes, you can use any HLS attributes and pragmas within your HLS task functions. For example, you can use attributes and pragmas like #pragma ii, #pragma ivdep, hls_memory, and hls_register.
You cannot use component macros or component invocation interface control attributes when you define HLS task functions. For example, you cannot use hls_avalon_slave_register_argument, hls_conduit_argument, hls_stall_free_return, or hls_avalon_streaming_component
10.2. Internal Streams
For an example of using the HLS tasks ihc::stream object as a FIFO, review the tutorial in <quartus_installdir>/hls/examples/tutorials/system_of_tasks/internal_stream.
This diagram is simplified from the tutorial. It shows 10 iterations, while the tutorial goes through 32 iterations.
In the diagram, i is the index of the outer loop and j is the index of the inner loop.

Each iteration of the outer loop reads all the values written by the previous loop iteration and writes one less value to the buffer. The internal stream outperforms the array in this design because array must allocate enough space to store written values before the values are read, but an internal stream does not need to allocate this space.
In addition, the trip count of the inner loop decreases by one in each outer loop, so the space claimed by array is never filled after the first iteration, which wastes area.
10.3. System of Tasks Simulation
When you simulate a system of tasks design where the completion of a task function is not synchronized with an ihc::collect call, use the ihc_hls_set_component_wait_cycle testbench API function to allow output from that task function to be returned after the component function finishes running.
By default, the simulation process simulates an additional 100 cycles after a component asserts the done signal to ensure all operations have propagated back to the testbench. This function tells the simulation process for the specified component to continue running for the specified number of cycles in addition to the default wait period of 100 cycles.
If you do not use this function in your testbench, the latency of some task functions might make your simulation output inaccurate or cause your simulation testbench to hang..
For an example of a valid systems of task design where the completion of a task function is not synchronized with an ihc::collect call, see Example 3 of a Valid ihc::launch/ihc::collect Sequence.
Function | Description |
---|---|
ihc_hls_set_component_wait_cycle | This function tells the simulation process to continue running for a specified number of additional cycles (beyond the default wait period of 100 cycles) after the done signal for the specified component is observed. |
11. Libraries
With libraries, you can reuse functions without knowing the underlying hardware design or implementation details. Libraries can be created with Intel FPGA high-level design tools including the Intel® HLS Compiler and the Intel® FPGA SDK for OpenCL* , either from code initially targeting that tool or from RTL code.
Static-Object Libraries
A static-object library is a single platform-specific archive file that contains one or more object files. A static-object file contains implementations of one or more functions. The object and library files use the same formats as the operating system that you compile your Intel® HLS Compiler code on, with additional sections that carry HLS-specific information.
On Linux platforms, an object library is a .a archive file that contains .o object files. On Windows platforms, a library is a .lib archive file that contains .obj object files.
A static-object library includes one or more function signature files that you include in your component source code so that your component can call the functions provided by the library. A function signature file is a C-style header file (.h) that declares the signatures of the functions that are provided in an object library.
Static-object libraries can be created from RTL or high-level source code.
Source Code Libraries
A source code library is a C-style header file that contains a source code library. You include this header file in your component source code, and the header file code is compiled along with your component.
You can use C++ templates to make your source code library more customizable.
The Intel® HLS Compiler provides some source code libraries that provide you with FPGA-optimized code for some commonly-used algorithms.
11.1. Static-Object Libraries
A static-object library is a single platform-specific archive file that contains one or more object files, each of which contains implementations of one or more functions.
The object and library files use the same formats as the operating system that you compile your Intel® HLS Compiler code on, with additional sections that carry additional library information. On Linux platforms, a library is a .a archive file that contains .o object files. On Windows platforms, a library is a .lib archive file that contains .obj object files.
You can call the functions in the library from your component without needing to know the hardware design or the implementation details of the underlying functions in the library. Add the library to the i++ command line when your compile your component.
- Intel® HLS Compiler Pro Edition
-
Intel® FPGA SDK for OpenCL™
Pro Edition
To create a library from your HLS code that targets the Intel® FPGA SDK for OpenCL™ , you must have the Intel® FPGA SDK for OpenCL™ Pro Edition installed. The version of the SDK must be same as your version of Intel® HLS Compiler.
- Each object file is generated from input source
files with the fpga_crossgen
command.
The required input source files depend on the type of source code you are creating the object from.
An object is effectively an intermediate representation of your source code with both a CPU representation and an FPGA representation of your code.
An object can be targeted for use with only one Intel® high-level design product. If you want to target more than one high-level design product, you must generate a separate object for each target product.
- Object files are collected into a library file with
the fpga_libtool command
Objects created from different types of source code can be collected into a library, provided all objects target the same high-level design product.
Libraries must be built and used by the same version number Intel FPGA high-level design tool. For example, to compile your component with the Intel HLS Compiler Version 20.4, the libraries included in your component must have been created with a version 20.4 Intel FPGA high-level design tool.
11.2. Creating a Static-Object Library
Creating a static-object library is a multistep process where you create the objects that you want to include in a library and then collect the objects into a library file.
To create a static-object library:
-
Create the objects for your library with
the fpga_crossgen
command. You can create your objects from a variety of
sources:
- Create an object from HLS code.
For details, see Creating Objects From HLS Code.
- Create an object from RTL code.
For details, see Creating Objects From RTL Code.
- Create an object from OpenCL
code.
For details, see Creating Library Objects From OpenCL Code in the Intel® FPGA SDK for OpenCL™ Pro Edition Programming Guide.
- Create an object from HLS code.
-
Collect the objects into an object library
with the fpga_libtool
command.
For details, see Packaging Object Files Into a Library.
fpga_crossgen foo.cpp –target hls -o foo.o fpga_crossgen bar.cl –target hls -o bar.o fpga_libtool –target hls –create foobar.a foo.o bar.o
You can use the resulting library (foobar.a) in your component by including the header file or files that you created for the library (for example ,foobar.a, or foo.h and bar.h) in your component.
i++ baz.cpp foobar.a
11.3. Creating Objects From HLS Code
You can create a library from object files from your HLS source code. An HLS-based object file contains code for CPU execution (testbench and emulation) and FPGA execution. A library can contain multiple object files.
You can create object files for use with different Intel high-level design tools from the same HLS source code.
Depending on the target high-level design tool, your source code might require adjustments to support tool-specifc data types or constructs.
Intel® HLS Compiler
No additional work is needed in your HLS source code when you use the code to create objects for Intel® HLS Compiler libraries.
Intel® FPGA SDK for OpenCL™
The Intel® FPGA SDK for OpenCL™ supports language constructs that are not natively supported by C++. Your component might need modifications to support those constructs.
The Intel® HLS Compiler supports a limited set of OpenCL* language constructs through the ocl_types.h header file. For details, review Supported OpenCL Language Constructs.
To create an object from your HLS code that targets the Intel® FPGA SDK for OpenCL™ , you must have the Intel® FPGA SDK for OpenCL™ Pro Edition installed. The version of the SDK must be the same as your version of Intel® HLS Compiler.
11.3.1. Creating an Object File From HLS Code
Use the fpga_crossgen command to create objects for your library from your HLS code. An object created from HLS code contains information required both for emulating the functions in the object and synthesizing the hardware for the object functions.
The fpga_crossgen command creates one object file from one input source file. The object created can be used only libraries that target the same Intel high-level design tool.
Objects are assigned the same version number as the version number of your Intel® HLS Compiler installation. Libraries can contain only objects with the same version number, and can only be used with Intel high-level design tools with the same version number.
extern "C" HLS_EXTERNAL int my_hls_func(int x);
fpga_crossgen <source_file> --target target_HLD_tool [-o <object_file_name>]
-
target_HLD_tool
The target Intel® high-level design tool for this library. This parameter can have one of the following values:
-
hls
Target this object to be included in libraries for components developed with the Intel® HLS Compiler.
Objects built for the Intel® HLS Compiler are created as operating system specific object files (.o on Linux). You cannot use objects created on one operating system with the Intel® HLS Compiler running on a different operating system.
-
aoc
Target this object to be included in libraries for kernels developed with the Intel® FPGA SDK for OpenCL™ .
Objects built for the Intel® FPGA SDK for OpenCL™ are not operating system specific. The objects are created as Intel® FPGA SDK for OpenCL™ object files (.aoco).
You must have the Intel® FPGA SDK for OpenCL™ Pro Edition installed to use this option. The version of the SDK must be the same as your version of Intel® HLS Compiler.
-
hls
If you do not specify an object file name with the -o option, the object file name defaults to be the same name as the source file name.
11.3.2. Supported OpenCL Language Constructs
If you are using the Intel® HLS Compiler to develop libraries to use with the Intel® FPGA SDK for OpenCL™ , you might need access to OpenCL* language constructs that are not typically available natively from C++ language elements. The Intel® HLS Compiler provides support for some OpenCL* language constructs through the ocl_types.h header file.
All basic signed and unsigned OpenCL* data types (double, float, long long, long, int, short, char and bool) are supported without needing the ocl_types.h header file.
#include "HLS/ocl_types.h"
- OpenCL* address space qualifiers
- Arbitrary precision integers (up to 64 bits)
- OpenCL* vector data types
OpenCL* Address Space Qualifiers
OpenCL* Address Space Qualifier | Intel® HLS Compiler Macro |
---|---|
__global | OCL_ADDRSP_GLOBAL |
__local | OCL_ADDRSP_LOCAL |
__constant | OCL_ADDRSP_CONSTANT |
__private | OCL_ADDRSP_PRIVATE |
Arbitrary Precision Integers
The ocl_types.h header file supports the OpenCL* intX_t and uintX_t data types up to 64 bits. However, these data types are in the ihc namespace to avoid conflicts with C-system header definitions.
That is, you can use ihc::int1_t through to ihc::int64_t and ihc::uint1_t through to ihc::uint64_t in your component.
Only use these data types to exchange data on your component interface (for example, parameters). Assign them to HLS ac_int<> data types in your component code.
11.4. Creating Objects From RTL Code
You can create a library from object files that package register transfer level (RTL) language source files. An RTL-based object file also contains an object manifest file (in XML format) that identifies the functions that are callable in the object file. A library can contain multiple RTL-based objects.
Creating a library from RTL code is a two-step process. First, each object file is created from the RTL source and emulation models as described in the object manifest file with the fpga_crossgen command. Then, one or more object files are collected into an HLS library file with the fpga_libtool command.
To create a library from RTL code, you need to create the following files and components:
File or Component | Description |
---|---|
RTL-based Functions | |
RTL module source files |
Verilog (.v), System Verilog (.sv), or VHDL (.vhd) files and accompanying memory initialization files (.mif or .hex) that define the RTL modules in the library. You cannot use additional files such as Intel® Quartus® Prime IP File (.qip), Synopsys Design Constraints File (.sdc), or Tcl Script File (.tcl). |
Object manifest file | An XML (.xml) file that describes the properties of the
callable functions available in the RTL module. The Intel® HLS Compiler uses these properties to integrate the RTL module in an HLS library into the component pipeline. |
RTL module function signature file | A C-style header file (.h) that declares the signatures of
the functions that are implemented by the RTL module and described
in the RTL module properties file. Use this header file in your HLS component source code so that your component can call the functions provided in the HLS library. |
HLS emulation model files | C++ files (.cpp and .h) that contain code that is functionally equivalent to the RTL component and has the same function signatures as the RTL component. The emulation model is used only for component emulation. Simulations use the RTL provided in the library. |
11.4.1. RTL Modules and the HLS Pipeline
HLS libraries allows you to use RTL modules that are written in Verilog, SystemVerilog, or VHDL inside HLS components. The Intel® HLS Compiler integrates the RTL modules into the HLS pipeline architecture.
Consider using HLS libraries in the following situations:
- You want to use optimized and verified RTL modules in HLS components without rewriting the modules as C++ functions.
- You want to implement HLS component functionality that you cannot express effectively in C++.
11.4.1.1. Integration of an RTL Module into the HLS Pipeline
The depicted RTL module has a latency of 3 cycles. Since the multiply and add operations have a latency of just one cycle, the compiler inserts buffering to balance the latency of the parallel data paths in the pipeline. A balanced latency allows the invocations of the HLS component to execute without stalling the pipeline.
Specifying the latency of the RTL module in the HLS library object manifest file allows the HLS compiler to balance the pipeline latencies in the HLS component. The pipeline integration protocol uses ready/valid handshaking, so the latency of the RTL module can be variable. However, the variability in the latency should be small to maximize performance. In addition, specify the latency in the HLS library object manifest file for the object in the HLS library so that the RTL module experiences a good approximation of the actual latency in steady state.
11.4.1.2. RTL Module Interfaces
For an RTL module to properly interact with other compiler-generated operations, you must support a simple ready/valid handshaking protocol at both the input and the output of an RTL module.
An RTL module must use a single streaming interface. That is, a single pair of ready and valid logic must control all the inputs.
You have the option to provide the necessary streaming ports but declare the RTL module as stall-free. In this case, you do not have to implement proper stall behavior because the Intel® HLS Compiler creates a wrapper for your module.
You must handle ivalid signals properly if your RTL module has an internal state. For more information, see Stall-Free RTL.
Consider the following interfaces for the RTL module myMod:

In this diagram, myMod interacts with the upstream module through data signals, arg1 and arg2, and control signals, ivalid (input) and oready (output). The ivalid control signal equals 1 (ivalid = 1) if and only if data signal arg1 and data signal arg2 contain valid data. When the control signal oready equals 1 (oready = 1), it indicates that the myMod RTL module can process the data signals arg1 and arg2 if they are valid (that is, ivalid = 1). When ivalid = 1 and oready = 0, the upstream module holds the values of ivalid, arg1, and arg2 in the next clock cycle.
The myMod module interacts with the downstream pipeline logic through the data signal result and the control signals, ovalid (output) and iready (input). The ovalid control signal equals 1 (ovalid = 1) if and only if the data signal result contains valid data. When the iready control signal equals 1 (ivalid = 1), the downstream module can process the data signal result if it is valid. When ovalid = 1 and iready = 0, the myMod RTL module must hold the valid of the ovalid and result signals in the next clock cycle.
11.4.1.2.1. RTL Module Interface Signals
The Intel® HLS Compiler expects the RTL module to support a single interface with readyLatency = 0, at both input and output.
- ivalid and iready as the input ready/valid interface
- ovalid and oready as the output ready/valid interface

For an RTL module with a fixed latency, the output signals (ovalid and oready) can have constant high values, and the input ready signal (iready) can be ignored.
A stall-free RTL module might receive an invalid input signal (ivalid is low). In this case, the module ignores the input and produces invalid data on the output. For a stall-free RTL module without an internal state, it might be easier to propagate the invalid input through the module. However, for an RTL module with an internal state, you must handle an ivalid = 0 input carefully.
Example Timing Diagram of a Stall-free RTL Component
Consider the following example timing diagram of a stall-free RTL component:

- IS_STALL_FREE value = "yes"
- IS_FIXED_LATENCY value = "yes"
- EXPECTED_LATENCY value = "2"
Example Timing Diagram of a Non-stall-free RTL Component
Consider the following example timing diagram of a non-stall-free RTL component:

- IS_STALL_FREE value = "no"
- IS_FIXED_LATENCY value = "no"
- EXPECTED_LATENCY value = "4"
11.4.1.3. RTL Reset and Clock Signals
Because of the common clock and reset drivers, an RTL module runs in the same clock domain as the HLS component that is integrating the RTL module. The module reset input is asserted whenever the HLS component is reset.
11.4.1.3.1. Intel Agilex and Intel Stratix 10 Design-Specific Reset Requirements for Stall-Free and Stallable RTL Modules
Reset Requirements for Stall-Free RTL Modules
A stall-free RTL module is a fixed-latency module for which the Intel® HLS Compiler can optimize away stall logic.
- When creating a stall-free RTL module for an Intel® Agilex™ and Intel® Stratix® 10 design, use synchronous clear signals only.
- After deassertion of the reset signal to the stall-free RTL module, the module must be operational within 15 clock cycles. If the reset signal is pipelined within the module, this requirement limits the reset pipelining to no more than 15 stages.
Reset Requirements for Stallable RTL Modules
A stallable RTL module has a variable latency, and it relies on backpressured input and output interfaces to function correctly.
- When creating a stallable RTL module for an Intel® Agilex™ and Intel® Stratix® 10 design, use synchronous clear signals only.
- After assertion of the reset signal to the stallable RTL module, the module must deassert its oready and ovalid interface signals within 40 clock cycles.
- After deassertion of the reset signal to the stallable RTL module, the module must be fully operational within 40 clock cycles. The module signals its readiness by asserting the oready interface signal.
11.4.1.4. Object Manifest File Syntax
The HLS library object manifest file is an XML file that maps the RTL modules in a library object to functions that can be called by your HLS code. The Intel® HLS Compiler uses the properties defined in the manifest file to integrate an RTL module into the component pipeline.
The following example show a simple object manifest file for an RTL module that implements a double-precision square root function. The RTL module is implemented in VHDL with a Verilog wrapper.
The following object manifest file is for an RTL module named my_fp_sqrt_double (line 2) that implements a callable function with a C interface named my_sqrtfd (line 2).
1: <RTL_SPEC> 2: <FUNCTION name="my_sqrtfd" module="my_fp_sqrt_double"> 3: <ATTRIBUTES> 4: <IS_STALL_FREE value="yes"/> 5: <IS_FIXED_LATENCY value="yes"/> 6: <EXPECTED_LATENCY value="31"/> 7: <CAPACITY value="31"/> 8: <HAS_SIDE_EFFECTS value="no"/> 9: <ALLOW_MERGING value="no"/> 10: <PARAMETER name="WIDTH" value="64"/> 11: </ATTRIBUTES> 12: <INTERFACE> 13: <AVALON port="clock" type="clock"/> 14: <AVALON port="resetn" type="resetn"/> 15: <AVALON port="ivalid" type="ivalid"/> 16: <AVALON port="iready" type="iready"/> 17: <AVALON port="ovalid" type="ovalid"/> 18: <AVALON port="oready" type="oready"/> 19: <INPUT port="datain" width="64"/> 20: <OUTPUT port="dataout" width="64"/> 21: </INTERFACE> 22: <REQUIREMENTS> 23: <FILE name="my_fp_sqrt_double_s5.v" /> 24: <FILE name="fp_sqrt_double_s5.vhd" /> 25: </REQUIREMENTS> 26: <RESOURCES> 27: <ALUTS value="2057"/> 28: <FFS value="3098"/> 29: <RAMS value="15"/> 30: <MLABS value="43"/> 31: <DSPS value="1.5"/> 32: </RESOURCES> 33: </FUNCTION> 34: </RTL_SPEC>
XML Element | Description |
---|---|
RTL_SPEC | Top-level element in the object manifest file. There can only be one such top-level element in the file. |
FUNCTION |
Element that defines the HLS function that the RTL module implements. The name attribute within the FUNCTION element specifies the function name. You might have multiple FUNCTION elements, each declaring a different function that you can call from the HLS component. The same RTL module can implement multiple functions by specifying different parameters. To use the same module with different parameter combinations, create a separate FUNCTION tag for each parameter combination. |
ATTRIBUTES | Element that contains other
XML elements that describe various
characteristics (for example, latency) of the
RTL module. The example RTL module takes one PARAMETER setting named WIDTH, which has a value of 64. See Table 29 for more details other ATTRIBUTES-specific elements. If you create multiple RTL-based functions using different modules or use the same RTL module with different PARAMETER settings, you must create a separate FUNCTION element for each function. |
INTERFACE | Element that contains other
XML elements that describe the RTL module
interface. The example object manifest file shows the streaming interface signals that every RTL module must provide (that is, clock, resetn, ivalid, iready, ovalid, and oready). The resetn signal is active
low. Its synchronicity depends on the target device:
The signal names must match the ones specified in the RTL module properties file. An error occurs during library creation if a signal name is different in the RTL code and the RTL module properties file. |
REQUIREMENTS | Element that specifies one or
more RTL resource files (that is, .v, .sv, .vhd, .hex, and
.mif).
The specified paths to these files are relative
to the location of the object manifest file.
Each RTL resource file becomes part of the
associated Platform Designer component that
corresponds to the entire HLS component. HLS libraries do not support .qip files. |
RESOURCES | Optional element that specifies an estimate of the FPGA resources that the RTL module uses. If you do not specify this element, the estimated FPGA resources that the RTL module uses defaults to zero in the HLS resource estimation report. |
11.4.1.4.1. XML Elements for ATTRIBUTES
XML Element | Description |
---|---|
IS_STALL_FREE |
Instructs the Intel® HLS Compiler to remove all stall logic around the RTL module. Set IS_STALL_FREE to "yes" to indicate that the module does not generate stalls internally and it cannot properly handle incoming stalls. The module ignores the stall input. If you set IS_STALL_FREE to "no", the module must properly handle all stall and valid signals. If you set IS_STALL_FREE to "yes", you must also set IS_FIXED_LATENCY to "yes". Also, if the RTL module has an internal state, it must properly handle ivalid=0 inputs. |
IS_FIXED_LATENCY |
Indicates whether the RTL module has a fixed latency. Set IS_FIXED_LATENCY to "yes" if the RTL module always takes a known number of clock cycles to compute its output. The value you assign to the EXPECTED_LATENCY element specifies the number of clock cycles. The safe value for IS_FIXED_LATENCY is "no". When you set IS_FIXED_LATENCY="no", the EXPECTED_LATENCY value must be at least 1. For a given RTL module, you may set IS_FIXED_LATENCY to "yes" and IS_STALL_FREE to "no". Such a module produces its output in a fixed number of clock cycles and handles stall signals properly. |
EXPECTED_LATENCY |
Specifies the expected latency of the RTL module. If you set IS_FIXED_LATENCY to "yes", set the EXPECTED_LATENCY value to be the exact latency of the module. Otherwise, the Intel® HLS Compiler generates incorrect hardware. For a module with variable latency, the Intel® HLS Compiler balances the pipeline around this module to the EXPECTED_LATENCY value that you specify. For modules that can stall and require use of signals such as iready, the EXPECTED_LATENCY value must be set to at least 1. The specified value and the actual latency might differ for a module with variable latency, which might affect the number of stalls inside the pipeline. However, the resulting hardware is functionally correct. |
CAPACITY |
Specifies the number of multiple inputs that this module can process simultaneously. You must specify a value for CAPACITY if you also set IS_STALL_FREE="no" and IS_FIXED_LATENCY="no". Otherwise, you do not need to specify a value for CAPACITY. If CAPACITY is strictly less than EXPECTED_LATENCY, the Intel® HLS Compiler automatically inserts capacity-balancing FIFO buffers after this module when necessary. A conservative but safe value for CAPACITY is 1. |
HAS_SIDE_EFFECTS | Indicates whether the RTL
module has side effects. Modules that have
internal states or communicate with external
memories are examples of modules with side
effects. Set HAS_SIDE_EFFECTS to "yes" to indicate that the module has side effects. Specifying HAS_SIDE_EFFECTS to "yes" ensures that optimization efforts do not remove calls to modules with side effects. Stall-free modules with side effects (that is, IS_STALL_FREE="yes" and HAS_SIDE_EFFECTS="yes") must properly handle ivalid=0 input cases because the module might receive invalid data occasionally. A conservative but safe value for HAS_SIDE_EFFECTS is "yes". This element along with the ALLOW_MERGING element allow the Intel® HLS Compiler to perform certain optimizations. For details, see Interaction Between ALLOW_MERGINGand HAS_SIDE_EFFECTS Elements. |
ALLOW_MERGING | Indicates that the compiler
can merge multiple instances of this RTL
module. Set ALLOW_MERGING to "yes" to allow merging of multiple instances of the module. Intel® recommends setting ALLOW_MERGING to "yes". The safe value for ALLOW_MERGING is "no". Marking the module with HAS_SIDE_EFFECTS="yes" does not prevent merging.This element along with the HAS_SIDE_EFFECTS element allow the Intel® HLS Compiler to perform certain optimizations. For details, see Interaction Between ALLOW_MERGINGand HAS_SIDE_EFFECTS Elements. |
PARAMETER |
Specifies the value of an RTL module parameter. PARAMETER attributes:
The value for an RTL module parameter can be specified using either a value or a type attribute. |
11.4.1.4.2. XML Elements for INTERFACE
In the RTL module properties file of the RTL module within an HLS library, there are XML elements under INTERFACE that define aspects of the RTL module interface.
The RTL module cannot access the memories of the HLS component.
XML Element | Description |
---|---|
INPUT |
Specifies the input parameter of the RTL module that receives the value of a call argument with the RTL-based function is called. INPUT attributes:
All call arguments must be passed by value. You cannot use reference, pointer, and array type arguments. |
OUTPUT |
Specifies the output parameter of the RTL module that represents the return value of functions based on this module. OUTPUT attributes:
The return value cannot be a pointer. |
STREAM | Specifies the stream
parameters to the RTL module.
STREAM attributes:
The values you specify here must match the values for the stream object input interface parameters in your component. For details about stream input interface parameters in your component, see Intel HLS Compiler Pro Edition Streaming Input Interfaces. The signal names in your RTL and your component code must align. For details, seeMapping HLS Data Types to RTL Signals. |
11.4.1.4.3. XML Elements for RESOURCES
XML Element | Description |
---|---|
ALUTS | Specifies the number of combinational adaptive look-up tables (ALUTs) that the module uses. |
FFS | Specifies the number of dedicated logic registers that the module uses. |
RAMS | Specifies the number of block RAMs that the module uses. |
DSPS | Specifies the number of digital signal processing (DSP) blocks that the module uses. |
MLABS | Specifies the number of memory logic arrays (MLABs) that the module uses. This value is equal to the number of adaptive logic modules (ALMs) that is used for memory divided by 10 because each MLAB consumes 10 ALMs. |
11.4.1.5. Mapping HLS Data Types to RTL Signals
All supported composite data types are represented by wide input or output signals. Typically, the components of a composite data type are presented with the first-declared value or value of lowest index in the low-order bits of the signal.
Streams
Stream objects are passed by value to the RTL function. The attributes specified for the STREAM element in the Object Manifest File XML should match the template arguments in the you component code.
-
<port>_data
Signal for the data passed on the stream. This signal must be an input signal for direction="in" and and output signal for direction="out".
The width of this signal must match the width attribute in the STREAM element.
-
<port>_valid
Used only if usesValid="yes" is set in the STREAM element in the XML file.
This signal is a single-bit signal for the valid signal.
If direction="in", this signal is an input signal. If direction="out", this is an output signal.
-
<port>_ready
Used only if usesReady="yes" is set in the STREAM element in the XML file.
This signal is a single-bit signal for the ready signal.
If direction="in", this signal is an input signal. If direction="out", this is an output signal.
-
<port>_empty
Used only if usesEmpty="yes" is set in the STREAM element in the XML file.
This signal is a single-bit signal for the empty signal.
If direction="in", this signal is an input signal. If direction="out", this is an output signal.
-
<port>_startofpacket
Used only if usesPackets="yes" is set in the STREAM element in the XML file.
This signal is a single-bit signal to indicated the start of a packet.
If direction="in", this signal is an input signal. If direction="out", this is an output signal.
-
<port>_endofpacket
Used only if usesPackets="yes" is set in the STREAM element in the XML file.
This signal is a single-bit signal to indicated the end of a packet.
If direction="in", this signal is an input signal. If direction="out", this is an output signal.
These signal names align with stream object input interface parameter names. For details about stream input interface parameters in your component, see Intel HLS Compiler Pro Edition Streaming Input Interfaces.
Arrays
In C++, arrays are passed as a pointer to the memory in which the array is stored.
The Intel® HLS Compiler does not support pointer parameters for RTL modules. However, C++ allows you to pass a struct by value, so you can declare a struct data type that has an array as one of its members and declare your function to accept an argument of this struct-type by value.
Structs
-
Unpacked Structs
When your struct declaration is not packed, the layout of the input signal corresponding to the struct data type is determined by C language-specific padding rules that cause the Intel® HLS Compiler to insert padding bytes before struct members that require a specific alignment.
You should use packed structs as arguments to your RTL modules unless there is a specific reason to conform to a particular padded struct layout.
-
Packed Structs
If the struct type is declared as packed, member values start on an 8-bit boundary.
The Intel® HLS Compiler does not insert padding bytes to align struct members on platform-defined boundaries. The second-declared member always starts in the next highest byte after high-order byte of the first-declared struct member.
-
System Verilog Structs
If you are developing an RTL module in System Verilog, you can declare a System Verilog struct type that corresponds to the C++ struct type that is mapped to the input signal of your RTL module.
The declaration order of the struct members is reversed in the System Verilog declaration because it specifies how the member signals should be concatenated to produce the composite signal. In a System Verilog concatenation expression, the bits are specified from high to low. That is, the last byte of the C++ struct type must be listed first in the System Verilog signal concatenation.
You can compile your emulation models as HLS components to obtain an interface_structs.v file that contains declarations of the System Verilog struct types corresponding to the struct-type arguments of those functions. For details, see the following tutorial:
<quartus_installdir>/hls/examples/tutorials/libraries/rtl_struct_mapping
-
Pointers in Structs
You cannot use struct types that have reference or pointer members as arguments to or return values from RTL-based functions.
11.4.1.6. HLS Emulation Models for RTL-Based Functions
For an RTL-based function, write C++ code that serves as an emulation model for that function. This model is used when you run your component in emulation mode.
The emulation model is not used when you simulate your component; simulations use RTL extracted from the library.
If your function uses static variables to hold internal state, the emulation is equivalent to the RTL functionality only if the function is called from only one place in the HLS component.
This behavior is different because on CPUs all calls to the function share the same state variables. On FPGAs, the RTL module is instantiated once for each location in the HLS component where the function is called, and these instances do no share state.
11.4.1.7. Potential Incompatibility between RTL Modules and Partial Reconfiguration
Consider a situation where you create and verify your library on a device that does not support Partial Reconfiguration (PR). If you then use the library RTL module inside a PR region, the module might not function correctly after PR.
- The RTL modules do not use memory logic array blocks (MLABs) with initialized content.
- The RTL modules do not make any assumptions regarding the power-up values of any logic.
For complete PR coding guidelines, refer to Creating a Partial Reconfiguration Design in the Partial Reconfiguration User Guide.
11.4.1.8. Stall-Free RTL
If you have an RTL module with a fixed latency that you want integrated into your component pipeline without surrounding stall logic, ensure that you set attributes in the object manifest file (.xml) as follows:
- Specify a value for the EXPECTED_LATENCY
attribute (under the FUNCTION element) so that the latency
equals the number of pipeline stages in the module. Important: An inaccurate EXPECTED_LATENCY value causes the RTL module to be out of sync with the rest of the pipeline, and can lead to functionally incorrect results.
- Set the IS_STALL_FREE attribute under the
FUNCTION element to "yes".
This setting instructs the Intel® HLS Compiler to avoid placing stall logic around the RTL module. This setting also tells the compiler that the RTL module produces a result after the number cycles specified in the EXPECTED_LATENCY attribute after accepting input values. The stall free logic produces a result every cycle but the result is delayed by the number cycles specified in the EXPECTED_LATENCY attribute.
For RTL modules with a fixed latency, the output signals ( ovalid and oready) can have constant high values, and the input ready signal ( iready) can be ignored.
A stall-free RTL module might receive an invalid input signal (ivalid is low). In this case, the module must produce invalid data on the output EXPECTED_LATENCY cycles after the cycle in which the input was invalid. For a stall-free RTL module without an internal state, you might find it convenient to propagate the invalid input through the module. If the module has an internal state, that state should not be affect by data inputs that are not accompanied by ivalid = 1.
11.4.1.9. RTL Module Restrictions and Limitations for HLS Libraries
RTL Module Restrictions
When you create an RTL module, ensure that it operates within the following restrictions:
- The RTL module must work correctly at any clock frequency that passes timing analysis.
- Data input and output sizes must match the sizes of the arguments and
return value declared in the RTL module function signature (.h) file.
For example, if you work with 24-bit values inside an RTL module, declare inputs to be 32 bits and declare the function signature to accept the uint data type. In the RTL module, accept the 32-bit input but discard the top 8 bits.
- RTL modules cannot connect to external I/O signals. All input and output signals must come from the HLS component that uses the library.
- An RTL module must have a clock port, a resetn port, and handshaking ports to support the data input and output interfaces. The handshake signal must be named ivalid, ovalid, iready, and oready.
- Every function call that corresponds to an RTL module instantiation is completely independent of other instantiations. No hardware is shared.
- An RTL module must receive all its inputs at the same time. A single ivalid input signifies that all inputs contain valid data.
RTL-Based Object Limitations
Using RTL modules in HLS libraries has the following limitations:
- You can only set RTL module parameters in the object manifest file
(.xml) file.
To use the same module with multiple parameter combinations, create a separate FUNCTION tag for each parameter combination.
- Pass data inputs to the RTL module only by value through the HLS
component code.
You cannot pass streams, pointers, or references as input to an RTL module.
For streaming data, extract data from the stream first in your component and then pass the extracted scalar data to the RTL module in the HLS library.
Passing data inputs to an RTL module as pointers or references causes a fatal error in the Intel® HLS Compiler.
- Names of RTL module source files cannot conflict with the names of
objects in other libraries or in file names of
Intel® HLS Compiler IP.
When you create a library, choose RTL module names that are unlike to conflict with other libraries or compiler IP. For example, prefix the name of your RTL modules with the name of your library.
If there is a naming conflict, the Intel® Quartus® Prime compilation of the HLS component might fail or result in a functionally-incorrect FPGA image.
- Names of the RTL module and its signals cannot conflict with reserved names defined by any of the supported RTL languages: Verilog, System Verilog, and VHDL.
- The Intel® HLS Compiler does not support .qip files. You must manually parse nested .qip files to create a flat list of RTL files.
11.4.2. Creating a Static-Object File from an RTL Module
Before an RTL module can be included in a library intended for use in an Intel® HLS Compiler design, create a platform-specific object (.o files on Linux, .obj files on Windows) from the RTL module. Use the fpga_crossgen command to create the object.
For instructions on creating an OpenCL* library object file from RTL, see " Packaging an RTL Component for an OpenCL Library " in the Intel® FPGA SDK for OpenCL™ Pro Edition Programming Guide.
Before you can create an HLS library object from an RTL module, ensure that the functions in your RTL module are functionally correct and that you have the following files ready:- RTL module source files
These files are the Verilog (.v), System Verilog (.sv), or VHDL (.vhdl) files and the accompanying memory initialization files (.mif or .hex) that define the RTL modules.
- RTL object manifest file
This XML file describes the callable interfaces of your RTL modules. Review Object Manifest File Syntax for details about what to include in this XML file.
- HLS emulation model file
These C++ files (.cpp and .h) provide an emulation model for the RTL module that allows you to emulate your component when it includes an HLS library that contains this RTL module. Full hardware compilations use the RTL source files.
- RTL module function signature file
This C-style header file (.h) declares the signatures of the functions that are implemented by the RTL module and described in the object manifest file. Include this file in you HLS component code for the component to call the functions provided by the RTL modules packaged in the object.
fpga_crossgen <object_manifest_file_name> --target hls --emulation_model <emulation_model_file_location> [-o <object_file_name>]
Where <object_manifest_file_name> is the full path of the RTL object manifest (.xml) file including the file name. This path can be a full or relative path.
If you do not specify an object file name with the -o option, the object file name defaults to be the same name as the object manifest file name. That is, an object manifest file named manifest.xml results in an object file named manifest.o (on Linux) or manifest.obj (on Windows).
The output of the command is a platform-specific object file ( (.o on Linux, .obj on Windows). The platform of the object file is determine by the platform where you run the fpga_crossgen command. When you run the command on Linux, you get a .o object file. When you run the command on Windows, you get a .obj object file.
11.5. Packaging Object Files Into a Library
Collect object files in a library file so that others can incorporate the library into their projects and call the functions that are contained in the objects in the library. Package object files into a library with the fpga_libtool command.
Before you package object files into a library, ensure that you have the path information for all of the object files that you want to include in the library.
All objects to be packaged in the library must have the same version number. This library can be used only by an Intel® high-level design tool with the same version number.
The fpga_libtool command creates libraries encapsulated in operating system specific archive files (.a on Linux, .lib on Windows).
Create the HLS library file with the following command:fpga_libtool --target target_HLD_tool --create library_name[.a | .lib | .aoclib] object_file_1 [object_file_2 ... object_file_n]
-
target_HLD_tool
The target Intel® high-level design tool for this library. This parameter can have one of the following values:
-
hls
Target this library for components developed with the Intel® HLS Compiler.
Libraries built for the Intel® HLS Compiler are encapsulated in operating system specific archive files (.a on Linux, .lib on Windows). You cannot use HLS libraries created on one operating system with the Intel® HLS Compiler running on a different operating system.
-
aoc
Target this library for kernels developed with the Intel® FPGA SDK for OpenCL™ .
Libraries built for the Intel® FPGA SDK for OpenCL™ are not operating system specific. The objects are created as Intel® FPGA SDK for OpenCL™ object files (.aoclib).
You must have the Intel® FPGA SDK for OpenCL™ Pro Edition installed to use this option. The version of the SDK must be the same as your version of Intel® HLS Compiler.
-
hls
-
library_name
The name of the library file.
Specify the file extension of the library files as follows, depending on the target high-level design tool:-
Intel® HLS Compiler
Specify the operating-system specific archive extension: .a for Linux-platform libraries and .lib for Windows-platform libraries.
-
Intel® FPGA SDK for OpenCL™
Specify .aoclib as the file extension for an OpenCL* library.
OpenCL* libraries are not operating-system specific.
-
Intel® HLS Compiler
You can specify one or more object files to include in the library.
fpga_libtool --create libdemo.a prim1.o prim2.o prim3.o --target hls
12. Advanced Hardware Synthesis Controls
12.1. The hls_fpga_reg() Function
In some cases, explicitly asking the compiler to insert a register stage between the operand and the return value of the function call can help improve the performance of your component. Use the hls_fpga_reg() function to insert at least one register between the operand and return value of the function call.
Typically, you do not need to use this function to achieve the performance from your component that you want.
- Breaking the critical paths between spatially distant portions of a data path, such as between processing elements of a large systolic array.
- Reducing the pressure on placement and routing efforts caused by spatially distinct portions of the kernel implementation.
<quartus_installdir>/hls/examples/tutorials/best_practices/fpga_reg
- Syntax
-
T hls_fpga_reg(T op)
where T can be any sized type.
- Description
-
The hls_fpga_reg() function directs the Intel® HLS Compiler to insert at least one hardware pipelining register on the signal path that assigns the operand to the return value. This built-in function operates as an assignment, where the operand is assigned to the return value. The assignment has no implicit semantic or functional meaning beyond a standard assignment.
Functionally, you can consider the hls_fpga_reg() function to be always optimized away by the compiler.
- Usage Notes
- You can nest hls_fpga_reg()
function calls to increase the minimum number of registers that are inserted on the
assignment path. Because each function call guarantees the insertion of at least one
register stage, the number of nested calls provides a lower limit on the number of
registers.For example, the following code snippet tells the compiler to insert at least two registers on the assignment path.
int out=hls_fpga_reg(hls_fpga_reg(in));
The compiler might insert more than two registers on the path.
13. Intel High Level Synthesis Compiler Pro Edition Reference Summary
13.1. Intel HLS Compiler Pro Edition i++ Command-Line Arguments
Use the i++ command-line arguments to affect how your component is compiled and linked.
General i++ Command Options
Option | Description |
---|---|
--debug-log | Generate the compiler diagnostics log. |
-h, --help | List compiler command options along with brief descriptions. |
-o result | Place compiler output into the <result> executable and the <result>.prj directory. |
-v | Display messages describing the progress of the compilation. |
--version | Display compiler version information. |
Command Options Affecting Compiling
Option | Default Value | Description |
---|---|---|
-c | Preprocess, parse, and generate object files. | |
--component component name | Comma-separated list of
function names to compile to RTL. To use this option,
your component must be configured with
C-linkage using the extern "C"
specification. For
example:
extern "C" int myComponent(int a, int b) Using the component function attribute is preferred over using the --component command option to indicate functions that you want the compile to RTL. |
|
-D macro [= val ] | Define a <macro> with <val> as its value. | |
-g | Generate debug information (default option). | |
-g0 | Do not generate debug information. | |
--gcc-toolchain=<GCC_dir> |
Specifies the path to a GCC installation that you want to use for compilation. This path should be the absolute path to the directory that contains the GCC lib, bin, and include folders. |
|
--hyper-optimized-handshaking=[auto|off] | auto |
This option applies to Intel® Agilex™ and Intel® Stratix® 10 devices only. Use this option to modify the handshaking protocol used in certain areas of your design. |
-I dir | Add directory <dir> to the end of the main include path. | |
-march=[x86-64 | FPGA_family | FPGA_part_number] | x86-64 | Generate code for an emulator flow (x86-64) or for the specified FPGA family or FPGA part number. |
--quartus-compile | Run the HDL generated through Intel® Quartus® Prime to generate accurate fMAX and area estimates. Your component is not expected to cleanly close timing. | |
--quartus-seed <seed> | Specifies the Fitter seed to use when your component is compiled to hardware by Intel® Quartus® Prime. | |
--simulator simulator_name | modelsim | Specifies the simulator
you are using to perform verification. This command option
can take the following values for
<simulator_name>:
If you do not specify this option, --simulator modelsim is assumed. |
-ffp-contract=fast |
Remove intermediate
rounding and conversion when possible,
except for code blocks fenced by #pragma clang fp
contract(off). To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/floating_point_ops |
|
-ffp-reassoc |
Relax the order of
floating point arithmetic operations, except
for code blocks fenced by #pragma clang fp
reassoc(off)
To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/floating_point_ops |
|
--daz | For double data types only, disable subnormal support in double-precision floating-point computations. | |
--rounding= [ieee | faithful] |
For double data types only, control rounding scheme for double-precision adders, multipliers, and dividers. If you do not specify this option, adders and multipliers use IEEE-754 round to nearest, ties to even (RNE) rounding (0.5 ULP) and dividers uses faithful rounding (1 ULP). The -rounding
option can take one of the following values:
|
|
--clock clock target | 240 MHz | Optimize the RTL for the
specified clock frequency or period. The clock target value must include a unit. For example:
i++ -march="Arria 10" test.cpp --clock 100MHz i++ -march="Arria 10" test.cpp --clock 10ns |
Command Options Affecting Linking
Option | Default Value | Description |
---|---|---|
-ghdl[=<depth>] | Enable full debug visibility and logging of HDL signals in
simulation. Use the optional <depth> attribute to specify how many levels of hierarchy are logged. If you do not specify a value for the <depth> attribute, all signals are logged. Use -ghdl=1 to log only the top-level signals. |
|
-L
dir
-L dir |
(Linux only) Add directory <dir> to the list of directories to be searched for library files specified with the -l option. | |
-l library | (Linux only) Use the library name <library> when linking. | |
--x86-only | Create only the testbench executable ( <result>.out/ <result>.exe). | |
--fpga-only | Create only the <result>.prj directory and its contents. |
13.2. Intel HLS Compiler Pro Edition Header Files
Coding your component to be compiled by the Intel® HLS Compiler requires you to include the hls.h header file. Other header files provided with the Intel® HLS Compiler provide FPGA-optimized implementations of certain C and C++ functions.
HLS Header File | Description |
---|---|
HLS/hls.h | Required for component identification and component parameter interfaces. |
HLS/math.h | Includes FPGA-specific definitions for the math functions from the math.h for your operating system. |
HLS/extendedmath.h | Includes additional FPGA-specific definitions of math functions not in math.h. |
HLS/ac_int.h | Provides FPGA-optimized arbitrary width integer support. |
HLS/ac_fixed.h |
Provides FPGA-optimized arbitrary precision fixed point support. |
HLS/ac_fixed_math.h |
Provides FPGA-optimized arbitrary precision fixed point math functions. |
HLS/ac_complex.h | Provides FPGA-optimized complex number support. |
HLS/hls_float.h | Provides FPGA-optimized arbitrary-precision IEEE-754 compliant floating-point number support. |
HLS/hls_float_math.h | Provides FPGA-optimized floating-point math functions. |
HLS/stdio.h | Provides printf support for components so that printf statements work in x86 emulations, but are disabled in component when compiling to an FPGA architecture. |
<iostream> | To use cout and cerr in your component, guard the statements with the HLS_SYNTHESIS macro. |
hls.h Header File
- Syntax
- #include "HLS/hls.h"
- Description
- Required for component identification and component parameter interfaces.
math.h Header File
- Syntax
- #include "HLS/math.h"
- Description
- Includes FPGA-specific definitions for the math
functions from the math.h for your
operating system.
To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/single_vs_double_precision_math.
extendedmath.h Header File
- Syntax
- #include "HLS/extendedmath.h"
- Description
- Includes additional FPGA-specific definitions of math
functions not in math.h.
To learn more, review the following design: <quartus_installdir>/hls/examples/QRD.
ac_int.h Header File
- Syntax
- #include "HLS/ac_int.h"
- Description
-
Intel® HLS Compiler
version of ac_int header file.
Provides FPGA-optimized arbitrary width integer support.
To learn more, review the following tutorials:- <quartus_installdir>/hls/examples/tutorials/ac_datatypes/ac_int_basic_ops
- <quartus_installdir>/hls/examples/tutorials/ac_datatypes/ac_int_overflow
- <quartus_installdir>/hls/examples/tutorials/best_practices/struct_interfaces
ac_fixed.h Header File
- Syntax
- #include "HLS/ac_fixed.h"
- Description
-
Intel® HLS Compiler
version of the ac_fixed header file.
Provides FPGA-optimized arbitrary precision fixed point support.
To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/ac_datatypes/ac_fixed_constructor.
ac_fixed_math.h Header File
- Syntax
- #include "HLS/ac_fixed_math.h"
- Description
-
Intel® HLS Compiler
version of the ac_fixed_math header
file.
Provides FPGA-optimized arbitrary precision fixed point math functions.
To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/ac_datatypes/ac_fixed_math_library.
ac_complex.h Header File
- Syntax
- #include "HLS/ac_complex.h"
- Description
-
Intel® HLS Compiler
version of the ac_comple header
file.
Provides FPGA-optimized complex math functions.
hls_float.h Header File
- Syntax
- #include "HLS/hls_float.h"
- Description
- Header file to provide FPGA-optimized
arbitrary-precision IEEE 754 compliant floating-point number support.To learn more, review the following tutorials:
- <quartus_installdir>/hls/examples/tutorials/hls_float/1_reduced_doubl
- <quartus_installdir>/hls/examples/tutorials/hls_float/2_explicit_arithmetic
- <quartus_installdir>/hls/examples/tutorials/hls_float/3_conversions
hls_float_math.h Header File
- Syntax
- #include "HLS/hls_float_math.h"
- Description
- Header file to provide math functions for hls_float data types.To learn more, review the following tutorials:
- <quartus_installdir>/hls/examples/tutorials/hls_float/1_reduced_doubl
- <quartus_installdir>/hls/examples/tutorials/hls_float/2_explicit_arithmetic
- <quartus_installdir>/hls/examples/tutorials/hls_float/3_conversions
stdio.h Header File
- Syntax
- #include "HLS/stdio.h"
- Description
- Provides printf support for components so that printf statements work in x86 emulations, but are disabled in component when compiling to an FPGA architecture.
Standard C++ <iostream> Header File
- Syntax
- #include <iostream>
- Description
- To use the C++ standard output streams (cout and cerr) provided by the standard <iostream> header,
you must guard any standard output statements with the HLS_SYNTHESIS macro.
This macro ensures that statements in a component work in x86 emulations but are disabled in the component when compiling to an FPGA architecture.
13.3. Intel HLS Compiler Pro Edition Compiler-Defined Preprocessor Macros
Tool Invocation | __INTELFPGA_COMPILER__ |
---|---|
g++ or cl | Undefined |
i++ -march=x86-64 | 2040 |
i++ -march="<FPGA_family_or_part_number>" | 2040 |
Tool Invocation | HLS_SYNTHESIS | |
---|---|---|
Testbench Code | HLS Component Code | |
g++ or cl | Undefined | Undefined |
i++ -march=x86-64 | Undefined | Undefined |
i++ -march="<FPGA_family_or_part_number>" | Undefined | Defined |
13.4. Intel HLS Compiler Pro Edition Keywords
Feature | Description |
---|---|
component | Indicates that a
function is a
component. Example:
component void foo() |
13.5. Intel HLS Compiler Pro Edition Simulation API (Testbench Only)
Function | Description |
---|---|
ihc_hls_enqueue | This function enqueues one invocation of an HLS component. |
ihc_hls_enqueue_noret | This function enqueues one invocation of an HLS component. This function should be used when the return type of the HLS component is void. |
ihc_hls_component_run_all | This function pushes all enqueued invocations of a component into the component in the HDL simulator as quickly as the component can accept new invocations. |
ihc_hls_sim_reset | This function sends a reset signal to the component during automated simulation. |
ihc_hls_set_component_wait_cycle | This function tells the simulation process to continue running for a specified number of additional cycles (beyond the default wait period of 100 cycles) after the done signal for the specified component is observed. |
ihc_hls_enqueue Function
- Syntax
- ihc_hls_enqueue(void* retptr, void* funcptr, /*function arguments*/)
- Description
- This function enqueues one
invocation of an HLS component. The return value is
stored in the first argument which should be a
pointer to the return type. The component is not run
until the ihc_hls_component_run_all() is
invoked.
To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/usability/enqueue_call.
ihc_hls_enqueue_noret Function
- Syntax
- ihc_hls_enqueue_noret(void* funcptr, /*function arguments*/)
- Description
- This function enqueues one
invocation of an HLS component. This function should
be used when the return type of the HLS component is
void. The component is not run until the ihc_hls_component_run_all() is
invoked.
To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/usability/enqueue_call.
ihc_hls_component_run_all Function
- Syntax
- ihc_hls_component_run_all (void* funcptr)
- Description
- This function accepts a pointer to
the HLS component function. When run, all enqueued
invocations of the component will be pushed into the
component in the HDL simulator as quickly as the
component can accept new invocations.
To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/usability/enqueue_call.
ihc_hls_sim_reset Function
- Syntax
- int ihc_hls_sim_reset(void)
- Description
- This function sends a reset signal
to the component during automated simulation. It
returns 1 if the reset was exercised or 0 otherwise.
To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/component_memories/static_var_init.
ihc_hls_set_component_wait_cycle Function
- Syntax
- ihc_hls_set_component_wait_cycle(<component function name>, <# of wait cycles>)
- Description
- This function tells the simulation process to
continue running for a specified number of
additional cycles (beyond the default wait period of
100 cycles) after the done signal for the specified
component is observed. This delay can enable task
functions with a higher latency than the component
function to successfully return their output during
simulation.
Use this function when you simulate a design that uses a system of tasks where the completion of a task function is not synchronized with an ihc::collect call.
By default, the simulation process simulates an additional 100 cycles after a component asserts the done signal to ensure all operations have propagated back to the testbench. This function tells the simulation process for the specified component to continue running for the specified number of cycles in addition to the default wait period of 100 cycles.
Simulation API Code Example
component int foo(int val) { // function definition } component void bar (int val) { // function definition } int main() { // ……. int input = 0; int res[5]; ihc_hls_enqueue(&res, &foo, input); ihc_hls_enqueue_noret(&bar, input); input = 1; ihc_hls_enqueue(&res, &foo, input); ihc_hls_enqueue_noret(&bar, input); ihc_hls_component_run_all(&foo); ihc_hls_component_run_all(&bar); }
13.6. Intel HLS Compiler Pro Edition Component Memory Attributes
Use the component memory attributes to control the on-chip component memory architecture of your component.
Memory Attribute | Description |
---|---|
hls_force_pow2_depth | Specifies that the memory implementing the variable or array has power-of-2 depth. |
hls_register | Forces a variable or
array to be carried through the pipeline in
registers. A register variable can be implemented either exclusively in flip-flops (FFs) or in a mix of FFs and RAM-based FIFOs. |
hls_memory | Forces a variable or array to be implemented as embedded memory. |
hls_memory_impl | Forces a variable or array to be implemented as embedded memory of a specified type. |
hls_singlepump | Specifies that the memory implementing the variable or array must be clocked at the same rate as the component accessing the memory. |
hls_doublepump | Specifies that the memory implementing the variable or array must be clocked at twice the rate as the component accessing the memory. |
hls_numbanks | Specifies that the memory implementing the variable or array must have a defined number of memory banks. |
hls_bankwidth | Specifies that the memory implementing the variable or array must have memory banks of a defined width. |
hls_bankbits | Forces the memory system to split into a defined number of memory banks and defines the bits used to select a memory bank. |
hls_simple_dual_port_memory | Specifies that the memory implementing the variable or array should have no port that services both reads and writes. |
hls_merge (depthwise) | Allows merging two or more local variables to be implemented in component memory as a single merged memory system in a depth-wise manner. |
hls_merge (widthwise) | Allows merging two or more local variables to be implemented in component memory as a single merged memory system in a width-wise manner. |
hls_init_on_reset | Forces the static variables inside the component to be initialized when the component reset signal is asserted. |
hls_init_on_powerup | Sets the component memory implementing the static variable to initialize on power-up when the FPGA is programmed. |
hls_max_concurrency |
Deprecated: This attribute is
deprecated and will be removed in a
future release. Use the
hls_private_copies memory
attribute
instead.
Specifies that the memory
implementing the variable or array has a
defined number of private copies to allow
concurrent iterations of a loop at any given
time. |
hls_max_replicates | Specifies that the memory implementing the variable or array has no more than the specified number of replicates to enable simultaneous reads from the datapath |
hls_private_copies | Specifies that the memory implementing the variable or array has a defined number of private copies to allow concurrent iterations of a loop at any given time. |
hls_force_pow2_depth Memory Attribute
- Syntax
- hls_force_pow2_depth(N)
- Constraints
- N can be only 0 or 1.
- Default Value
- 1
- Description
- Specifies that the memory
implementing the variable or array has a power-of-2
depth. This option is enabled if N is 1, and disabled if
N is
0.
To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/component_memories/non_power_of_two_memory.
hls_register Memory Attribute
- Syntax
- hls_register
- Constraints
- N/A
- Default Value
- Based on the memory access pattern inferred by the compiler.
- Description
- Forces a variable or array to be
implemented as registers.
To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/swap_vs_copy.
hls_memory Memory Attribute
- Syntax
- hls_memory
- Constraints
- N/A
- Default Value
- Based on the memory access pattern inferred by the compiler.
- Description
- Forces a variable or array to be
implemented as embedded memory.
To learn more, review the following tutorial: <quartus_installdir>/hls/tutorials/component_memories/memory_implementation.
hls_memory_impl Memory Attribute
- Syntax
- hls_memory_impl("type")
- Constraints
- N/A
- Default Value
- Based on the memory size and memory access pattern inferred by the compiler.
- Description
-
Forces a variable or array to be implemented as embedded memory of the specified type.
The type parameter can be one of the following values:- BLOCK_RAM
- Implement the variable or array as memory blocks, such as M20K memory blocks.
- MLAB
- Implement the variable or array as memory logic array blocks (MLABs).
To learn more, review the following tutorial: <quartus_installdir>/hls/tutorials/component_memories/memory_implementation.
hls_singlepump Memory Attribute
- Syntax
- hls_singlepump
- Constraints
- N/A
- Default Value
- Based on the memory access pattern inferred by the compiler.
- Description
- Specifies that the memory
implementing the variable or array must be clocked
at the same rate as the component accessing the
memory.
To learn more, review the following tutorial: <quartus_installdir>/hls/examples/QRD.
hls_doublepump Memory Attribute
- Syntax
- hls_doublepump
- Constraints
- N/A
- Default Value
- Based on the memory access pattern inferred by the compiler.
- Description
- Specifies that the memory
implementing the variable or array must be clocked
at twice the rate of the component accessing the
memory.
To learn more, review the following tutorial: <quartus_installdir>/hls/tutorials/component_memories/memory_bank_configuration.
hls_numbanks Memory Attribute
- Syntax
- hls_numbanks(N)
- Constraints
- This attribute is subject to constraints outlined in Constraints on Attributes for Memory Banks.
- Default Value
- Based on the memory access pattern inferred by the compiler.
- Description
- Specifies that the memory
implementing the variable or array must have
N banks,
where N is a
power-of-two constant number.
To learn more, review the following tutorial: <quartus_installdir>/hls/tutorials/component_memories/memory_geometry.
hls_bankwidth Memory Attribute
- Syntax
- hls_bankwidth(N)
- Constraints
- This attribute is subject to constraints outlined in Constraints on Attributes for Memory Banks.
- Default Value
- Based on the memory access pattern inferred by the compiler.
- Description
- Specifies that the memory
implementing the variable or array must have banks
that are N
bytes wide, where N is a power-of-two constant
number.
To learn more, review the following tutorial: <quartus_installdir>/hls/tutorials/component_memories/memory_geometry.
hls_bankbits Memory Attribute
- Syntax
- hls_bankbits(b 0 , b 1 , ..., b n )
- Constraints
- This attribute is subject to constraints outlined in Constraints on Attributes for Memory Banks.
- Default Value
- Based on the memory access pattern inferred by the compiler.
- Description
- Forces the memory system to split
into 2n+1 banks, with
{b
0
, b
1
, ..., b
n
} forming the bank-select
bits.Important: b 0 , b 1 , ..., b n must be consecutive, positive integers. You can specify the consecutive, positive integers in ascending or descending order.
If you do not specify the hls_bankwidth(N) attribute along with this attribute, then b 0 , b 1 , ..., b n are mapped to array index bits 0 to n-1 in the memory bank implementation.
To learn more, review the following tutorial: <quartus_installdir>/hls/tutorials/component_memories/memory_geometry.
hls_simple_dual_port_memory Memory Attribute
- Syntax
- hls_simple_dual_port_memory
- Constraints
- N/A
- Default Value
- N/A
- Description
- Specifies that the memory
implementing the variable or array should have no
port that services both reads and writes.
To learn more, review the following tutorial: <quartus_installdir>/hls/tutorials/component_memories/memory_bank_configuration.
hls_merge (depthwise) Memory Attribute
- Syntax
- hls_merge("mem_name", "depth")
- Constraints
- N/A
- Default Value
- N/A
- Description
- Allows merging two or more local
variables to be implemented in component memory as a
single merged memory system in a depth-wise
manner.
All variables with same <mem_name> label specified in their hls_merge attribute are merged into the same memory system.
To learn more, review the following tutorial: <quartus_installdir>/hls/tutorials/component_memories/memory_merging.
hls_merge (widthwise) Memory Attribute
- Syntax
- hls_merge("mem_name", "width")
- Constraints
- N/A
- Default Value
- N/A
- Description
- Allows merging two or more local
variables to be implemented in component memory as a
single merged memory system in a width-wise
manner.
All variables with same <mem_name> label specified in their hls_merge attribute are merged into the same memory system.
To learn more, review the following tutorial: <quartus_installdir>/hls/tutorials/component_memories/memory_merging.
hls_init_on_reset Memory Attribute
- Syntax
- hls_init_on_reset
- Constraints
- N/A
- Default Value
- Default behavior for static variables.
- Description
- Forces the static variable inside
the component to be initialized when the component
reset signal
is asserted. This requires an additional write port
to the component memory implemented and can increase
the power-up latency when the component is reset.
To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/component_memories/static_var_init.
hls_init_on_powerup Memory Attribute
- Syntax
- hls_init_on_powerup
- Constraints
- N/A
- Default Value
- N/A
- Description
- Sets the component memory
implementing the static variable to initialize on
power-up when the FPGA is programmed. When the
component is reset, the component memory is not
reset back to the initialized value of the static.
To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/component_memories/static_var_init.
hls_max_concurrency Memory Attribute
- Syntax
- hls_max_concurrency(N)
- Constraints
- N/A
- Default Value
- N/A
- Description
- Specifies that the memory
implementing the variable or array has N private copies to
allow N
concurrent iterations of a loop at any given time.
Apply this attribute only when the scope of a variable (through its declaration or access pattern) is limited to a loop. If the loop has the max_concurrency pragma applied to it, the number of private copies created is the lesser of the hls_max_concurrency memory attribute value and the max_concurrency pragma value.
hls_max_replicates Memory Attribute
- Syntax
- hls_max_replicates(N)
- Constraints
- N/A
- Default Value
- N/A
- Description
- Specifies that the memory
implementing the variable or array has no more than
the N
replicates to enable simultaneous reads from the
datapath.
To learn more, review the following tutorial: <quartus_installdir>/hls/tutorials/component_memories/memory_bank_configuration.
hls_private_copies Memory Attribute
- Syntax
- hls_private_copies(N)
- Constraints
- N/A
- Default Value
- N/A
- Description
- Specifies that the memory
implementing the variable or array has N private copies to
allow N
concurrent iterations of a loop at any given time.
Apply this attribute only when the scope of a variable (through its declaration or access pattern) is limited to a loop. If the loop has the max_concurrency pragma applied to it, the number of private copies created is the lesser of the hls_private_copies memory attribute value and the max_concurrency pragma value.
13.7. Intel HLS Compiler Pro Edition Loop Pragmas
Use the Intel® HLS Compiler loop pragmas to control how the compiler pipelines the loops in your component.
Pragma | Description |
---|---|
disable_loop_pipelining | Prevents compiler from pipelining a loop, |
ii | Forces a loop to have a loop initiation interval (II) of a specified value. |
ivdep | Ignores memory dependencies between iterations of this loop. |
loop_coalesce | Tries to fuse all loops nested within this loop into a single loop. |
loop_fuse | Directs the compiler to try and fuse pairs of adjacent loops. |
max_concurrency | Limits the number of iterations of a loop that can simultaneously execute at any time. |
max_interleaving | Controls whether iterations of a pipelined inner loop in a loop nest from one invocation of the inner loop can be interleaved in the component data pipeline with iterations from other invocations of the inner loop. |
nofusion | Prevents the annotated loop from being fused with adjacent loops. |
speculated_iterations | Specifies the number of clock cycles that a loop exit condition can take to compute. |
unroll | Unrolls the loop completely or by a number of times. |
disable_loop_pipelining Loop Pragma
- Syntax
- #pragma disable_loop_pipelining
- Description
- Tells the compiler to not pipeline this
loop.
Disable loop pipelining for a loop when the loop-carried dependencies cause the loop iterations to effectively execute sequentially. With loop pipelining disabled, the Intel® HLS Compiler can generate a simpler datapath and reduce the FPGA area utilization of your component.
Example:#pragma disable_loop_pipelining for (int i = 1; i < N; i++) { int j = a[i-1]; // Memory dependency induces a high-latency loop feedback path a[i] = foo(j) }
ii Loop Pragma
- Syntax
- #pragma ii N
- Description
- Forces the loop to which you apply this pragma
to have a loop initiation interval (II) of <N>, where <N> is a positive integer
value.
Forcing a loop II value can have an adverse effect on the fMAX of your component because using this pragma to get a lower loop II combines pipeline stages together and creates logic with a long propagation delay.
Using this pragma with a larger loop II inserts more pipeline stages and can give you a better component fMAX value.
Example:#pragma ii 2 for (int i = 0; i < 8; i++) { // Loop body }
ivdep Loop Pragma
- Syntax
- #pragma ivdep safelen(N) array(array_name)
- Description
- Tells the compiler to ignore memory
dependencies between iterations of this loop.
It can accept an optional argument that specifies the name of the array. If array is not specified, all component memory dependencies are ignored. If there are loop-carried dependencies, your generated RTL produces incorrect results.
The safelen parameter specifies the dependency distance. The dependency distance is the number of iterations between successive load/stores that depend on each other. It is safe to not include safelen is only when the dependence distance is infinite (that is, there are no real dependencies).
Example:#pragma ivdep safelen(2) for (int i = 0; i < 8; i++) { // Loop body }
To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/loop_memory_dependency.
loop_coalesce Loop Pragma
- Syntax
- #pragma loop_coalesce N
- Description
- Tells the compiler to try to fuse all loops
nested within this loop into a single loop. This pragma accepts an
optional value N which
indicates the number of levels of loops to coalesce together.
#pragma loop_coalesce 2 for (int i = 0; i < 8; i++) { for (int j = 0; j < 8; j++) { // Loop body } }
loop_fuse Block-Scope Loop Pragma
- Syntax
- #pragma loop_fuse [depth(N)] [independent]
- Description
- Apply this pragma to a block of code to
indicate to the compiler that adjacent loops in the code block
should be fused when safe, overriding the compiler profitability
analysis of the fusion.
The depth(N) clause sets the number of nesting depths the compiler should consider when fusing adjacent loops. Specifying depth(1) is equivalent to indicating that only adjacent top-level loops should be considered for fusing.
The independent clause overrides the safety checks. If you specify the independent option, you are guaranteeing to the compiler that fusing pairs of loops affected by the loop_fuse pragma is safe. If it is not safe, you might get functional errors in your component.
For details of the safety checks, see the Fusion Criteria section of Loop Fusion.
Example:
#pragma loop_fuse { for (int j=0; j < N; ++j){ data[j] += Q; } for (int i = 0; i < N; ++l){ output[i] = Q * data[i]; } }
max_concurrency Loop Pragma
- Syntax
- #pragma max_concurrency N
- Description
- This pragma limits the number of iterations of
a loop that can simultaneously execute at any time.
This pragma is useful mainly when private copies of are created to improve the throughput of the loop. This is mentioned in the details pane for the loop in the Loop Analysis pane and the Bank view of the Function Memory Viewer of the high level design report (report.html).
This can occur only when the scope of a component memory (through its declaration or access pattern) is limited to this loop. Adding this pragma can be used to reduce the area that the loop consumes at the cost of some throughput.
Example:// Without this pragma, // multiple private copies // of the array "arr" #pragma max_concurrency 1 for (int i = 0; i < 8; i++) { int arr[1024]; // Loop body }
max_interleaving Loop Pragma
- Syntax
- #pragma max_interleaving <option>
- Description<option>
- This pragma controls whether iterations of a
pipelined inner loop in a loop nest from one invocation of the
inner loop can be interleaved in the component data pipeline with
iterations from other invocations of the inner loop.
By default, the Intel® HLS Compiler tries interleave a number simultaneous invocations of the inner loop equal to the loop initiation interval (II) of the inner loop. For example, an inner loop with an II of 2 can have iterations from two invocations in the pipeline at a time.
In cases where the interleaving of loop iterations from different loop invocations does not yield a performance benefit, limiting or restricting the amount of interleaving can result in reduced FPGA area utilization.
Supported values for <option>:- 1
The compiler restricts the annotated (inner) loop to be invoked only once per outer loop iteration. That is, all iterations of the inner loop travel the pipeline before the next invocation of the inner loop can occur.
- 0
Use the default interleaving behavior.
Example:// Loop j is pipelined with ii=1 for (int j = 0; j < M; j++) { int a[N]; // Loop i is pipelined with ii=2 #pragma max_interleaving 1 for (int i = 1; i < N; i++) { a[i] = foo(i) } … }
- 1
nofusion Loop Pragma
- Syntax
- #pragma nofusion
- Description
- This pragma directs the compiler to not fuse
the annotated loop with any adjacent loops.Example:
#pragma nofusion L1: for (int j=0; j < N; ++j){ data[j] += Q; } L2: for (int i = 0; i < N; ++l) { output[i] = Q * data[i]; }
speculated_iterations Loop Pragma
- Syntax
- #pragma speculated_iterations N
- Description
- This pragma specifies the number of loop
iterations to wait before considering a loop exit condition. That
is, you estimate that a loop takes at least N loop iterations before the exit
condition is met.
If you specify a value that is too low, then the loop II increases to accommodate the iterations required to determine whether the loop exit condition is met.
Example:
component int loop_speculate (int N) { int m = 0; // The exit path has 2 multiplies and // compare is most critical in loop feedback path #pragma speculated_iterations 2 while (m*m*m < N) { m += 1; } return m; }
unroll Loop Pragma
- Syntax
- #pragma unroll N
- Description
- This pragma unrolls the loop completely or by <N> times, where <N> is optional and is a positive integer value.Important: Unrolling nested loops with large bounds might generate a large number of instructions that could result in very long compile times for your component.Example:
#pragma unroll 8 for (int i = 0; i < 8; i++) { // Loop body }
To learn more, review the tutorial: <quartus_installdir>/hls/examples/best_practices/resource_sharing_filter.
13.8. Intel HLS Compiler Pro Edition Scope Pragmas
Use the Intel® HLS Compiler scope pragmas to influence the rounding of floating-point operations and the ordering of arithmetic operations in your component at finer grain than the i++ command options.
Pragma | Description |
---|---|
fp contract | Controls the removal of intermediate rounding and conversion when possible within the code block that this pragma is applied to. |
fp reassoc | Controls the relaxing of the order of floating point arithmetic operations within the code block that this pragma is applied to. |
fp contract Scoped Pragma
- Syntax
- #pragma clang fp contract(state)
- Description
- This pragma controls whether the compiler can contract
floating-point multiply and add or subtract operations into a single
fused multiply-add (FMA), and controls whether the compiler skip
intermediate rounding and conversions.
If multiple occurrences of this pragma affect the same scope, the pragma with the narrowest scope takes precedence.
-
The state parameter can be one of the following values:
-
off
Turns off any permissions to fuse instructions into FMAs.
The effect of the -ffp-contract=fast i++ command flag is suppressed for instructions within the scope of the pragma.
-
fast
Allows the fusing of mult andadd instructions into an FMA, but might violate the language standard.
For instructions with the scope of this pragma, the same optimizations as -ffp-contract=fast i++ command flag are enabled.
-
off
fp reassoc Scoped Pragma
- Syntax
- #pragma clang fp reassoc(state)
- Description
- This pragma controls whether the compiler can relax
the order of floating point operations requested by the source code.
With some reordering, the compiler can optimize the hardware structure
which improves the performance of your component.
If multiple occurrences of this pragma affect the same scope, the pragma with the narrowest scope takes precedence.
-
The state parameter can be one of the following values:
-
on
Enables the effect of the -ffp-reassoc i++ command flag for instructions within the scope of the pragma.
-
off
The effect of the -ffp-reassoc i++ command flag is suppressed for instructions within the scope of the pragma.
-
on
13.9. Intel HLS Compiler Pro Edition Component Attributes
Attribute | Description |
---|---|
hls_component_ii | Force the component to which you apply this attribute to have a specified component initiation interval (II). |
hls_disable_component_pipelining | Prevents the creation of the pipelined component datapath. Multiple invocations of this component now occur sequentially and not simultaneously. |
hls_max_concurrency | Request more copies of the component memory so that the component can run multiple invocations in parallel. |
hls_scheduler_target_fmax_mhz | Specify the target clock frequency of your component. |
hls_use_stall_enable_clusters | Group related operations into stall-enabled clusters to try to improve latency and area usage while possibly sacrificing throughput, when compared to the default stall-free clustering implementation. |
hls_component_ii Component Attribute
- Syntax
- hls_component_ii(<N>)
- Description
- Forces the component to which you
apply this attribute to have a component initiation
interval (II) of <N>, where <N> is a
positive integer value.
This can have an adverse effect on the fMAX of your component because using this attribute to get a lower II combines pipeline stages together and creates logic with a long propagation delay.
Using this attribute with a larger II inserts more pipeline stages and can give you a better component fMAX value.
hls_disable_component_pipelining Component Attribute
- Syntax
- hls_disable_component_pipelining
- Description
- Tells the compiler to not create a
pipelined datapath for the component. An unpipelined
component datapath can save FPGA area utilization in
some cases.
Use this attribute when a pipelined datapath does not improve your component throughput or when the component is not invoked repeatedly.
- Example
-
#include "HLS/hls.h" hls_disable_component_pipelining component void baz ( /* arguments */ ){ // component code }
hls_max_concurrency Component Attribute
- Syntax
- hls_max_concurrency(<N>)
- Description
- In some cases, the concurrency of
a component is limited to 1. This limit
occurs when the generated hardware cannot be shared
across component invocations. For example, when
using component memories for a non-static variable.
You can use this attribute to request more copies of the component memory so that the component can run multiple invocations in parallel.
This attribute can accept any non-negative whole number, including 0.- Value greater than 0
- A value greater than 0 indicates how many copies of the component memory to instantiate as well as how many component invocations can be in flight at once.
- Value equal to 0
- Setting hls_max_concurrency to a value of 0 is useful in cases when there is no component memory but the component still has a poor dynamic loop initiation interval (II) even if you believe your component II should be 1. You can review the II for loops in your component in the high level design report.
To learn more, review the design example: <quartus_installdir>/hls/examples/inter_decim_filter.
- Example
-
hls_max_concurrency(2) component void foo(ihc::stream_in<int> &data_in, ihc::stream_out<int> &data_out) { int arr[N]; for (int i = 0; i < N; i++) { arr[i] = data_in.read(); } // Operate on the data and modify in place for (int i = 0; i < N; i++) { data_out.write(arr[i]); } }
hls_scheduler_target_fmax_mhz Component Attribute
- Syntax
- hls_scheduler_target_fmax_mhz(<N>)
- Description
- Apply the hls_scheduler_target_fmax_mhz
component attribute to have the compiler target a
specific fMAX value.
Specify the target fMAX
value in MHz.
The component is not guaranteed to close timing at the specified frequency, and any tasks in a system of tasks use the same clock regardless of having different scheduling targets.
hls_use_stall_enable_clusters Component Attribute
- Syntax
- hls_use_stall_enable_clusters
- Description
- Apply the hls_use_stall_enable component
attribute to reduce the area of your component while
possibly decreasing your component fMAX and throughput.
The Intel® HLS Compiler typically groups related operations into clusters. In many cases, the clusters are stall-free clusters. A stall-free cluster executes the operations without any stalls and contains a FIFO at the end of the cluster that holds the results if the cluster is stalled. This FIFO adds area and latency to the component, but might allow a higher fMAX and increased throughput.
If you prefer lower FPGA area usage and lower latency over higher throughput, use the hls_use_stall_enable component attribute to bias the compiler to produce stall-enabled clusters. Stall-enabled clusters lack the FIFO, which reduces area and latency, but pass stall signals to the contained operations.
Passing stall signals might reduce fMAX.
Not all operations support stall, and these operations cannot be contained in a stall-enabled cluster. The compiler generates a warning if some operations cannot be placed into a stall-enabled cluster.
The compiler automatically uses stall-enabled clusters for HLS components if it can determine that stall-enable is always beneficial. This attribute requests the compiler to form stall-enabled clusters if possible.
Intel Agilex and Stratix 10 Restriction: This attribute does not apply to designs that target Intel® Agilex™ or Intel® Stratix® 10 devices unless you specify the --hyper-optimized-handshaking=off option of the Intel® HLS Compiler i++ command.To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/stall_enable
13.10. Intel HLS Compiler Pro Edition Component Default Interfaces
Interface | Description |
---|---|
Component invocation interface (component call and return) |
The component call is implemented as an
interface consisting of the component start and busy
conduits. The component return is also implemented as an interface that includes the component done and stall signals. |
Scalar parameter interface (passed by value) |
Scalar parameters are implemented as input conduits that are synchronized with the component invocation interface. |
Pointer parameter interface (passed by reference) |
Pointer parameters are implemented as an
implicit Avalon Memory-Mapped Master (mm_master) interface with the default
parametrization. By default, the base address is treated as a scalar parameter so it is implemented as a conduit that is synchronized to the component invocation interface. A memory mapped interface is also exposed on the component. |
13.11. Intel HLS Compiler Pro Edition Component Invocation Interface Control Attributes
Control Attribute | Description |
---|---|
hls_avalon_streaming_component |
This is the default component invocation interface. The component uses start, busy, stall, and done signals for handshaking. |
hls_avalon_slave_component | The start, done, and returndata (if applicable) signals appear in the component CSR instead of as conduits outside of the signal. |
hls_always_run_component | The start signal is tied to 1 internally in the component. There is no done signal output. |
hls_stall_free_return | If the downstream component never stalls, the stall signal is removed by internally setting it to 0. |
hls_avalon_streaming_component Invocation Control Attribute
- Description
-
This is the default component invocation interface.
This attribute follows the Avalon® ST protocol for both the function call and the return streams. The component consumes the unstable arguments when the start signal is asserted and the busy signal is deasserted. The component produces the return data when the done signal is asserted.
- Top-Level Module Ports
-
- Function call:
- start
- busy
- Function return:
- done
- stall
- Function call:
- Example
-
component hls_avalon_streaming_component void foo(/*component parameters*/)
hls_avalon_slave_component Invocation Control Attribute
- Description
- The start,
done, and
returndata
(if applicable) signals are registered in the
component slave memory map. Because the signals are
registered in the memory map, the start/busy and stall/done handshaking
signals are also removed. The removal of these
handshaking signals also removes the handshaking
signal for the input parameters.For signals to be synchronized properly, each of the component parameters must be one of the following parameter types:
- Slave register argument ( hls_avalon_slave_register_argument ), so that the signals are in the register map. This includes Avalon® MM master or pointer interfaces that have the hls_avalon_slave_register_argument parameter applied.
- Slave memory argument ( hls_avalon_slave_memory_argument ) so that a dedicated Avalon® MM slave interface for handshaking is created.
- Stable argument ( hls_stable_argument ), to explicitly indicate that the signals do not require handshaking. This includes Avalon® MM master and pointer interfaces that have the hls_stable_argument parameter applied.
- Streaming interface arguments, so that a dedicated Avalon® ST interface for handshaking is created.
If you do not specify one of these component parameters, the compiler generates an error message when you compile this component.
To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/interfaces/mm_slaves.
- Top-Level Module Ports
-
- Avalon MM slave interface
- irq_done signal
- Example
-
component hls_avalon_slave_component void foo(/*component parameters*/)
hls_always_run_component Invocation Control Attribute
- Description
- The start signal is tied to 1 internally in the
component. There is no done signal output. The control logic
is optimized away when
Intel®
Quartus® Prime compiles the generated RTL
for your FPGA.
Use this protocol when the component data path relies only on explicit streams for data input and output.
IP verification does not support components with this component invocation protocol.
- Top-Level Module Ports
- None
- Example
-
component hls_always_run_component void foo(/*component parameters*/)
hls_stall_free_return Invocation Control Attribute
- Description
- If the downstream component never
stalls, the stall
signal is removed by internally setting it to
0.
This feature can be used with the hls_avalon_streaming_component, hls_avalon_slave_component, and hls_always_run_component arguments. This attribute can be used to specify that the downstream component is stall free.
- Top-Level Module Ports
- N/A
- Example
-
component hls_stall_free_return int dut(int a, int b) { return a * b;}
13.12. Intel HLS Compiler Pro Edition Component Macros
Macro | Description |
---|---|
hls_avalon_slave_register_argument | Implement the parameter as a register that can be read from and written to over an Avalon® memory-mapped (MM) slave interface. |
hls_avalon_slave_memory_argument | Implement the parameter, in on-chip memory blocks, which can be read from or written to over a dedicated slave interface. |
hls_conduit_argument | Implement the parameter as an input conduit that is synchronous to the component call (start and busy). |
hls_readwrite_mode | Indicate to the compiler how the slave memory interface is accessed by external Avalon® memory-mapped (MM) masters. |
hls_stable_argument | A stable parameter is a parameter that does not change while there is live data in the component (that is, the argument does not change between pipelined function invocations). |
hls_avalon_slave_register_argument Component Macro
- Syntax
- hls_avalon_slave_register_argument
- Description
- The compiler implements the
parameter
as a register that can be read from and written to over an Avalon MM
slave interface. The
parameter
will be read into the component pipeline, similar to the conduit
implementation. The implementation is synchronous to the start and busy
interface.
Changes to the value of this parameter made by the component data path will not be reflected on this register.
To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/interfaces/mm_slaves.
- Example
-
component void foo( hls_avalon_slave_register_argument int b)
hls_avalon_slave_memory_argument Component Macro
- Syntax
- hls_avalon_slave_memory_argument(N)
- Description
- The compiler implements the
parameter,
where N specifies the size of the
memory in bytes, in on-chip memory blocks, which can be read from or
written to over a dedicated slave interface. The generated memory has
the same architectural optimizations as all other internal component
memories (such as banking or coalescing).
If the compiler performs static coalescing optimizations, the slave interface data width is the coalesced width. This attribute applies only to a pointer parameter.
To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/interfaces/mm_slaves.
- Example
-
component void foo( hls_avalon_slave_memory_argument(128*sizeof(int)) int *a)
hls_conduit_argument Component Macro
- Syntax
- hls_conduit_argument
- Description
-
This is the default interface for scalar parameters.
The compiler implements the parameter as an input conduit that is synchronous to the component call (start and busy).
- Example
-
component void foo(hls_conduit_argument int b)
hls_readwrite_mode Component Macro
- Syntax
- hls_readwrite_mode("type")
- Description
- This macro applies only to slave
memory interfaces.
Indicates to the compiler how the slave memory interface is accessed by external memory masters. This information can help the compiler build a more efficient memory system and might save FPGA area for your component.
The type parameter can take any one of the following values:-
readonly
Indicates that the external Avalon® memory-mapped (MM) master interface only ever reads from the slave memory.
-
writeonly
Indicates that the external Avalon® MM master interface only ever writes to the slave memory.
-
readonly
- Example
-
component void foo(hls_avalon_slave_memory_argument(128*sizeof(int)) hls_readwrite_mode(“writeonly”) int *A)
hls_stable_argument Component Macro
- Syntax
- hls_stable_argument
- Description
- A stable
parameter
is an
parameter
that does not change while there is live data in the component (that is,
the
component argument does not between pipelined function
invocations).
Changing a stable parameter during component execution results in undefined behavior; each use of the stable parameter might be the old value or the new value, but with no guarantee of consistency. The same variable in the same invocation can appear with multiple values.
Using stable parameters, where appropriate, might save a significant number of registers in a design.Stable parameters can be used with conduits, Avalon® MM master interfaces, and slave_registers.
To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/interfaces/stable_arguments.
- Example
-
component int dut( hls_stable_argument int a, hls_stable_argument int b) { return a * b;}
13.13. Intel HLS Compiler Pro Edition Systems of Tasks API
Function | Description |
---|---|
ihc::launch | Marks a function as an Intel® HLS Compiler task for hardware generation, and launches the task function asynchronously. |
ihc::collect | Synchronizes the completion of the specified task function in the component. |
ihc::stream | Allows streaming communication between different task functions. |
ihc::launch_always_run | Launches a task function at
component power-on or reset and continuously
executes the function. Recommendation: Use the
ihc_hls_set_component_wait_cycle
with this function to keep your
component and always-run task functions
correctly coordinated.
|
ihc::launch Function
- Syntax
-
ihc::launch<function[,
capacity]>(fuction_argument_list)
Where the function parameters are defined as follows:
-
function
The name of the function that you are calling as an Intel® HLS Compiler task in your component.
-
capacity
An optional value that, when set, results in a FIFO buffer of depth capacity inserted between the function that launches the task and the task function.
Set the capacity parameter when you observe stall patterns that indicate an imbalance between any backpressure introduced by the called task function (function) and how often the caller launches this task function.
-
fuction_argument_list
The list of arguments to pass to the task function.
This list must match the arguments (in names and types) that the task function expects.
-
- Description
- The ihc::launch API function identifies a
function as
Intel® HLS Compiler task for hardware
generation. Calling this function starts the task
function asynchronously.
If the task function cannot accept a new thread, the ihc::launch function can block the function that calls the ihc::launch function.
The list of arguments that supply the ihc::launch API function must match (in names and types) the list of arguments expected by the task function.
ihc::collect Function
- Syntax
-
ihc::collect<function[, capacity]>()
Where the function parameters are defined as follows:
-
function
The name of the Intel® HLS Compiler task function to synchronize the completion of.
Set the capacity parameter when you observe stall patterns that indicate that the task function (function) produces data at a different cadence from the reading cadence of the caller.
-
capacity
An optional value that, when set, results in a FIFO buffer of size capacity inserted between the task function and the function that collects the task.
-
function
- Description
- The ihc::collect API function synchronizes
the completion of the specified task function in the
component.
For a non-void task function, the ihc::collect API function collects the result from the specified task function.
For a void task function, the ihc::collect API function synchronizes against the done signal of the task function.
The number of ihc::collect calls for a task function must match the number of ihc::launch calls for the same task function to flush all of the calls to the task.
Special Case: If you do not use ihc::collect at all, the compiler optimizes and ties-off the return stream of the task to be stall free and ignores any data on the return stream. Other streaming interfaces can still back-pressure the task function. Additionally, the caller might finish before the task function.
ihc::launch_always_run Function
- Syntax
-
ihc::launch_always_run<function>()
Where the function parameters are defined as follows:
-
<function>
The name of the function that you are calling as a continuously-executing Intel® HLS Compiler task in your component.
-
<function>
- Description
-
Use the ihc::launch_always_run API function to continuously execute a task function, much like an invoking a component with the hls_always_run_component invocation interface argument.
The task launches at the power-on or the reset of the component instead of at a specific point in the datapath.
The task function that you provide to this API must match this prototype:void function(void)
Your task function must be have no function arguments and no return value. You should communicate with your task function through global streams or by using compile-time constant template parameters.
Use the ihc_hls_set_component_wait_cycle API function when using the ihc::launch_always_run API function because the top level component can finish before all of the always-run task functions are done processing the work allotted to them.
- Example
- The following example shows a
simple use of the ihc::launch_always_run
function.
ihc::stream<int> in_stream, out_stream; template <ihc::stream<int> &inStream, ihc::stream<int> &outStream> void my_task() { int x = inStream.read(); x *= 2; outStream.write(x); } component void foo() { ihc::launch_always_run<my_task<in_stream, out_stream>>(); }
Intel® HLS Compiler System of Tasks Code Example
int mul(int a, int b) { return a * b; } Template<typename T> T add(T a, T b) { return a + b; } component int foo(int a, int b) { ihc::launch<mul>(a,b); ihc::launch<add<int>>(a,b); int prod = ihc::collect<mul>(); int sum = ihc::collect<add<int>>(); return sum + prod; }
13.13.1. ihc::stream Class
Template Object or Parameter | Description |
---|---|
ihc::stream | Streaming interface to the component or task function. |
ihc::buffer | Specifies the capacity (in words) of the FIFO buffer on the input data that associates with the stream. |
ihc::usesPackets | Exposes the startofpacket and endofpacket sideband signals on the stream interface. |
ihc::stream Template Object
- Syntax
- ihc::stream<datatype, template arguments >
- Valid Values
- Any trivially copyable C++ data type.
- Default Value
- N/A
- Description
- Streaming interface to the
component or task.
The width of the stream data bus is equal to a width of sizeof(datatype).
ihc::buffer Template Parameter
- Syntax
- ihc::buffer<value>
- Valid Values
- Non-negative integer value.
- Default Value
- 0
- Description
- The capacity, in words, of the FIFO buffer on the input data that associates with the stream.
ihc::usesPackets Template Parameter
- Syntax
- ihc::usesPackets<value>
- Valid Values
- true or false
- Default Value
- false
- Description
- Exposes the startofpacket and endofpacket sideband signals on the stream interface, which can be accessed by the packet based reads/writes.
Intel® HLS Compiler System of Tasks Streaming Interface stream Function APIs
Function API | Description |
---|---|
T read() | Blocking read call to be used from within the component or task |
T read(bool& sop, bool& eop) |
Available only if usesPackets<true> is set. Blocking read with out-of-band startofpacket and endofpacket signals. |
T tryRead(bool &success) | Non-blocking read call to be used from within the component or task. The success bool is set to true if the read was valid. |
T tryRead(bool& success, bool& sop, bool& eop) |
Available only if usesPackets<true> is set. Non-blocking read with out-of-band startofpacket and endofpacket signals. |
void write(T data) | Blocking write call from the component or task. |
void write(T data, bool sop, bool eop) |
Available only if usesPackets<true> is set. Blocking write with out-of-band startofpacket and endofpacket signals. |
bool tryWrite(T data) | Non-blocking write call from the component or task. The return value represents whether the write was successful. |
bool tryWrite(T data, bool sop, bool eop) |
Available only if usesPackets<true> is set. Non-blocking write with out-of-band startofpacket and endofpacket signals.The return value represents whether the write was successful. |
13.14. Intel HLS Compiler Pro Edition Pipes API
template <class name, class dataT, size_t min_capacity = 0> class pipe { public: // Blocking static dataT read(); static void write(dataT data); // Non-blocking static dataT read(bool &success); static void write(dataT data, bool &success); }
Parameter | Description |
---|---|
name | The type that is the basis of a pipe
identification. It is typically a user-defined class, in a user namespace. Forward declaration of the type is enough, and the type need not be defined. |
dataT | The data type of the packet contained
within a pipe. This is the data type that is read during a successful pipe read() operation, or written during a successful pipe write() operation. The type must have a standard layout and be trivially copyable. |
min_capacity | The minimum number of words (in units of
dataT) that the pipe must
be able to store without any being read out. The compiler might create a pipe with a larger capacity due to performance considerations. |
13.15. Intel HLS Compiler Pro Edition Streaming Input Interfaces
Use the stream_in object and template arguments to explicitly declare Avalon® Streaming (ST) input interfaces. You can also use the stream_in Function APIs.
Template Object or Parameter | Description |
---|---|
ihc::stream_in | Streaming input interface to the component. |
ihc::buffer | Specifies the capacity (in words) of the FIFO buffer on the input data that associates with the stream. |
ihc::readyLatency | Specifies the number of cycles between when the ready signal is deasserted and when the input stream can no longer accept new inputs. |
ihc::bitsPerSymbol | Describes how the data is broken into symbols on the data bus. |
ihc::firstSymbolInHighOrderBits | Specifies whether the data symbols in the stream are in big endian order. |
ihc::usesPackets | Exposes the startofpacket and endofpacket sideband signals on the stream interface. |
ihc::usesEmpty |
Exposes the empty out-of-band signal on the stream interface. |
ihc::usesValid | Controls whether a valid signal is present on the stream interface. |
ihc::stream_in Template Object
- Syntax
- ihc::stream_in<datatype, template parameters >
- Valid Values
- Any valid C++ datatype
- Default Value
- N/A
- Description
- Streaming input interface to the
component.
The width of the stream data bus is equal to a width of sizeof(datatype).
The testbench must populate this buffer (stream) fully before the component can start to read from the buffer.
To learn more, review the following tutorials:- <quartus_installdir>/hls/examples/tutorials/interfaces/explicit_streams_buffer
- <quartus_installdir>/hls/examples/tutorials/interfaces/explicit_streams_packets_empty
- <quartus_installdir>/hls/examples/tutorials/interfaces/explicit_streams_packet_ready_valid
- <quartus_installdir>/hls/examples/tutorials/interfaces/explicit_streams_ready_latency
- <quartus_installdir>/hls/examples/tutorials/interfaces/multiple_stream_call_sites
ihc::buffer Template Parameter
- Syntax
- ihc::buffer<value>
- Valid Values
- Non-negative integer value.
- Default Value
- 0
- Description
- The capacity, in words, of the
FIFO buffer on the input data that associates with
the stream. The buffer has latency. It immediately
consumes data, but this data is not immediately
available to the logic in the component.
If you use the tryRead() function to access this stream and the stream read is scheduled within the first cycles of operation, the first (or more) calls to the tryRead() function might return false in simulation (and therefore in hardware).
Review the function viewer in the Graph Viewer of the High Level Design Reports to see when operations are scheduled in your component. If you see this behavior, use the blocking read() function to ensure consistency between emulation and simulation.
This parameter is available only on input streams.
ihc::readyLatency Template Parameter
- Syntax
- ihc::readyLatency<value>
- Valid Values
- Non-negative integer value between 0-8.
- Default Value
- 0
- Description
- The number of cycles between when the ready signal is deasserted and when the input stream can no longer accept new inputs.
ihc::bitsPerSymbol Template Parameter
- Syntax
- ihc::bitsPerSymbol<value>
- Valid Values
- A positive integer value that evenly divides by the data type size.
- Default Value
- Datatype size
- Description
- Describes how the data is broken
into symbols on the data bus.
Data is broken down according to how you set the ihc::firstSymbolInHighOrderBits declaration. By default, data is broken down in little endian order.
ihc::firstSymbolInHighOrderBits Template Parameter
- Syntax
- ihc::firstSymbolInHighOrderBits<value>
- Valid Values
- true or false
- Default Value
- false
- Description
- Specifies whether the data symbols
in the stream are in big endian order.
ihc::usesPackets Template Parameter
- Syntax
- ihc::usesPackets<value>
- Valid Values
- true or false
- Default Value
- false
- Description
- Exposes the startofpacket and endofpacket sideband signals on the stream interface, which can be accessed by the packet based reads/writes.
ihc::usesEmpty Template Parameter
- Syntax
- ihc::usesEmpty<value>
- Valid Values
- true or false
- Default Value
- false
- Description
-
Exposes the empty out-of-band signal on the stream interface.
Use this declaration only with streams that read more than one data symbol per clock cycle.
The empty signal indicates the number of symbols on the data bus that do not represent valid data during the final stream read of a packet.
You can control whether the empty symbols are in the low-order bits or high-order bits with the ihc::firstSymbolInHighOrderBits declaration.
ihc::usesValid Template Parameter
- Syntax
- ihc::usesValid<value>
- Valid Values
- true or false
- Default Value
- true
- Description
- Controls whether a valid signal is
present on the stream interface. If false, the upstream
source must provide valid data on every cycle that
ready is
asserted.
This is equivalent to changing the stream read calls to tryRead and assuming that success is always true.
If set to false, buffer and readyLatency must be 0.
Intel® HLS Compiler Pro Edition Streaming Input Interface stream_in Function APIs
Function API | Description |
---|---|
T read() | Blocking read call to be used from within the component |
T read(bool& sop, bool& eop) |
Available only if usesPackets<true> is set. Blocking read with out-of-band startofpacket and endofpacket signals. |
T read(bool& sop, bool& eop, int& empty) | Available only if
usesPackets<true> and
usesEmpty<true> are
set. Blocking read with out-of-band startofpacket, endofpacket, and empty signals. |
T tryRead(bool &success) | Non-blocking read call
to be used from within the component. The
success bool is set to true if
the read was valid. That is, the
Avalon®
-ST
valid
signal was high when the component tried to
read from the stream. The emulation model of tryRead() is not cycle-accurate, so the behavior of tryRead() might differ between emulation and simulation. |
T tryRead(bool& success, bool& sop, bool& eop) |
Available only if usesPackets<true> is set. Non-blocking read with out-of-band startofpacket and endofpacket signals. |
T tryRead(bool& success, bool& sop, bool& eop, int& empty) | Available only if
usesPackets<true> and
usesEmpty<true> are
set. Non-blocking read with out-of-band startofpacket, endofpacket, and emptysignals. |
void write(T data) | Blocking write call to be used from the testbench to populate the FIFO to be sent to the component. |
void write(T data, bool sop, bool eop) |
Available only if usesPackets<true> is set. Blocking write call with out-of-band startofpacket and endofpacket signals. |
void write(T data, bool sop, bool eop, int empty) | Available only if
usesPackets<true> and
usesEmpty<true> are
set. Blocking write call with out-of-band startofpacket, endofpacket, and empty signals. |
Intel® HLS Compiler Streaming Input Interfaces Code Example
// Blocking read void foo (ihc::stream_in<int> &a) { int x = a.read(); } // Non-blocking read void foo_nb (ihc::stream_in<int> &a) { bool success = false; int x = a.tryRead(success); if (success) { // x is valid } } int main() { ihc::stream_in<int> a; ihc::stream_in<int> b; for (int i = 0; i < 10; i++) { a.write(i); b.write(i); } foo(a); foo_nb(b); }
13.16. Intel HLS Compiler Pro Edition Streaming Output Interfaces
Use the stream_out object and template arguments to explicitly declare Avalon® Streaming (ST) output interfaces. You can also use the stream_out Function APIs.
Template Object or Parameter | Description |
---|---|
ihc::stream_out | Streaming output interface from the component. |
ihc::readylatency | Specifies the number of cycles between when the ready signal is deasserted and when the input stream can no longer accept new inputs. |
ihc::bitsPerSymbol | Describes how the data is broken into symbols on the data bus. |
ihc::firstSymbolInHighOrderBits | Specifies whether the data symbols in the stream are in big endian order. |
ihc::usesPackets | Exposes the startofpacket and endofpacket sideband signals on the stream interface. |
ihc::usesEmpty |
Exposes the empty out-of-band signal on the stream interface. |
ihc::usesReady | Controls whether a ready signal is present. |
ihc::stream_out Template Object
- Syntax
- ihc::stream_out<datatype, template parameter >
- Valid Values
- Any valid POD (plain old data) C++ datatype.
- Default Value
- N/A
- Description
- Streaming output interface from
the component. The testbench can read from this
buffer once the component returns. To learn more, review the following tutorials:
- <quartus_installdir>/hls/examples/tutorials/interfaces/ explicit_streams_buffer
- <quartus_installdir>/hls/examples/tutorials/interfaces/ explicit_streams_packets_empty
- <quartus_installdir>/hls/examples/tutorials/interfaces/ explicit_streams_packet_ready_valid
- <quartus_installdir>/hls/examples/tutorials/interfaces/ explicit_streams_ready_latency
- <quartus_installdir>/hls/examples/tutorials/interfaces/ mulitple_stream_call_sites
ihc::readylatency Template Parameter
- Syntax
- ihc::readylatency<value>
- Valid Values
- Non-negative integer value (between 0-8)
- Default Value
- 0
- Description
- The number of cycles between when
the ready signal
is deasserted and when the sink can no longer accept
new inputs.
Conceptually, you can view this parameter as an almost ready latency on the input FIFO buffer for the data that associates with the stream.
ihc::bitsPerSymbol Template Parameter
- Syntax
- ihc::bitsPerSymbol<value>
- Valid Values
- Positive integer value that evenly divides the data type size.
- Default Value
- Datatype size
- Description
- Describes how the data is broken
into symbols on the data bus.
Data is broken down according to how you set the ihc::firstSymbolInHighOrderBits declaration. By default, data is broken down in little endian order.
ihc::firstSymbolInHighOrderBits Template Parameter
- Syntax
- ihc::firstSymbolInHighOrderBits<value>
- Valid Values
- true or false
- Default Value
- false
- Description
- Specifies whether the data symbols
in the stream are in big endian order.
ihc::usesPackets Template Parameter
- Syntax
- ihc::usesPackets<value>
- Valid Values
- true or false
- Default Value
- false
- Description
- Exposes the startofpacket and endofpacket sideband signals on the stream interface, which can be accessed by the packet based reads/writes.
ihc::usesEmpty Template Parameter
- Syntax
- ihc::usesEmpty<value>
- Valid Values
- true or false
- Default Value
- false
- Description
-
Exposes the empty out-of-band signal on the stream interface.
Use this declaration only with streams that write more than one data symbol per clock cycle.
The empty signal indicates the number of symbols on the data bus that do not represent valid data during the final stream write of a packet.
You can control whether the empty symbols are in the low-order bits or high-order bits with the ihc::firstSymbolInHighOrderBits declaration.
ihc::usesReady Template Parameter
- Syntax
- ihc::usesReady<value>
- Valid Values
- true or false
- Default Value
- true
- Description
- Controls whether a ready signal is
present. If false,
the downstream sink must be able to accept data on
every cycle that valid is asserted. This is
equivalent to changing the stream read calls to
tryWrite and
assuming that success is always true.
If set to false, readyLatency must be 0.
Intel® HLS Compiler Pro Edition Streaming Output Interface stream_out Function APIs
Function API | Description |
---|---|
void write(T data) | Blocking write call from the component |
void write(T data, bool sop, bool eop) |
Available only if usesPackets<true> is set. Blocking write with out-of-band startofpacket and endofpacket signals. |
void write(T data, bool sop, bool eop, int empty) | Available only if
usesPackets<true> and
usesEmpty<true> are
set. Blocking write with out-of-band startofpacket, endofpacket, and empty signals. |
bool tryWrite(T data) | Non-blocking write call from the component. The return value represents whether the write was successful. |
bool tryWrite(T data, bool sop, bool eop) |
Available only if usesPackets<true> is set. Non-blocking write with out-of-band startofpacket and endofpacket signals.The return value represents whether the write was successful. That is, the downstream interface was pulling the ready signal high while the HLS component tried to write to the stream. |
bool tryWrite(T data, bool sop, bool eop, int empty) | Available only if
usesPackets<true> and
usesEmpty<true> are
set. Non-blocking write with out-of-band startofpacket, endofpacket, and empty signals. The return value represents whether the write was successful. |
T read() | Blocking read call to be used from the testbench to read back the data from the component |
T read(bool &sop, bool &eop) |
Available only if usesPackets<true> is set. Blocking read call to be used from the testbench to read back the data from the component with out-of-band startofpacket and endofpacket signals. |
T read(bool &sop, bool &eop, int &empty) | Available only if
usesPackets<true> and
usesEmpty<true> are
set. Blocking read call to be used from the testbench to read back the data from the component with out-of-band startofpacket, endofpacket, and empty signals. |
Intel® HLS Compiler Streaming Output Interfaces Code Example
// Blocking write void foo (ihc::stream_out<int> &a) { static int count = 0; for(int idx = 0; idx < 5; idx ++){ a.write(count++); // Blocking write } } // Non-blocking write void foo_nb (ihc::stream_out<int> &a) { static int count = 0; for(int idx = 0; idx < 5; idx ++){ bool success = a.tryWrite(count++); // Non-blocking write if (success) { // write was successful } } } int main() { ihc::stream_out<int> a; foo(a); // or foo_nb(a); // copy output to an array int outputData[5]; for (int i = 0; i < 5; i++) { outputData[idx] = a.read(); } }
13.17. Intel HLS Compiler Pro Edition Memory-Mapped Interfaces
Use the mm_master object and template arguments to explicitly declare Avalon® Memory-Mapped (MM) Master interfaces for your component.
Template Object or Parameter | Description |
---|---|
ihc::mm_master | The underlying pointer type. |
ihc::dwidth | The width of the memory-mapped data bus in bits |
ihc::awidth | The width of the memory-mapped address bus in bits. |
ihc::aspace | The address space of the interface that associates with the master. |
ihc::latency | The guaranteed latency from when a read command exits the component when the external memory returns valid read data. |
ihc::maxburst | The maximum number of data transfers that can associate with a read or write transaction. |
ihc::align | The alignment of the base pointer address in bytes. |
ihc::readwrite_mode | The port direction of the interface. |
ihc::waitrequest |
Adds the waitrequest signal that is asserted by the slave when it is unable to respond to a read or write request. |
getInterfaceAtIndex | This testbench function is used to index into an mm_master object. |
ihc::mm_master Template Object
- Syntax
- ihc::mm_master<datatype, template parameter >
- Valid values
- Any valid C++ datatype
- Default Value
- Default interface for pointer arguments.
- Description
- The underlying pointer type.
Pointer arithmetic performed on the master object
conforms to this type. Dereferences of the master
results in a load-store site with a width of sizeof(datatype). The default
alignment is aligned to the size of the datatype.
You can use multiple template arguments in any combination as long the combination of arguments describes a valid hardware configuration.
Example:component int dut( ihc::mm_master<int, ihc::aspace<2>, ihc::latency<3>, ihc::awidth<10>, ihc::dwidth<32> > &a)
To learn more, review the following tutorials:- <quartus_installdir>/hls/examples/tutorials/interfaces/pointer_mm_master
- <quartus_installdir>/hls/examples/tutorials/interfaces/mm_master_testbench_operators
ihc::dwidth Template Parameter
- Syntax
- ihc::dwidth<value>
- Valid Values
- 8, 16, 32, 64, 128, 256, 512, or 1024
- Default Value
- 64
- Description
- The width of the memory-mapped data bus in bits.
ihc::awidth Template Parameter
- Syntax
- ihc::awidth<value>
- Valid Values
- Integer value in the range 1 – 64
- Default Value
- 64
- Description
- The width of the memory-mapped
address bus in bits.
This value affects only the width of the Avalon® MM Master interface. The size of the conduit of the base address pointer is always set to 64-bits.
ihc::aspace Template Parameter
- Syntax
- ihc::aspace<value>
- Valid Values
- Integer value greater than 0.
- Default Value
- 1
- Description
- The address space of the interface that associates with the master. Each unique value results in a separate Avalon MM Master interface on your component. All masters with the same address space are arbitrated within the component to a single interface. As such, these masters must share the same template parameters that describe the interface.
ihc::latency Template Parameter
- Syntax
- ihc::latency<value>
- Valid Values
- Non-negative integer value
- Default Value
- 1
- Description
- The guaranteed latency from when a read command exits the component when the external memory returns valid read data. If this latency is variable (such as when accessing DRAM), set it to 0.
ihc::maxburst Template Parameter
- Syntax
- ihc::maxburst<value>
- Valid Values
- Integer value in the range 1 – 1024
- Default Value
- 1
- Description
- The maximum number of data
transfers that can associate with a read or write
transaction. This value controls the width of the
burstcount
signal.
For fixed latency interfaces, this value must be set to 1.
For more details, review information about burst signals and the burstcount signal role in "Avalon Memory-Mapped Interface Signal Roles" in Avalon Interface Specifications.
ihc::align Template Parameter
- Syntax
- ihc::align<value>
- Valid Values
- Integer value greater than the alignment of the datatype
- Default Value
- Alignment of the datatype
- Description
- The alignment of the base pointer
address in bytes.
The Intel® HLS Compiler uses this information to determine how many simultaneous loads and stores this pointer can permit.
For example, if you have a bus with 4 32-bit integers on it, you should use ihc::dwidth<128> (bits) and ihc::align<16> (bytes). This means that up to 16 contiguous bytes (or 4 32-bit integers) can be loaded or stored as a coalesced memory word per clock cycle.
Important: The caller is responsible for aligning the data to the set value for the align argument; otherwise, functional failures might occur.
ihc::readwrite_mode Template Parameter
- Syntax
- ihc::readwrite_mode<value>
- Valid Values
- readwrite, readonly, or writeonly
- Default Value
- readwrite
- Description
- The port direction of the interface. Only the relevant Avalon master signals are generated.
ihc::waitrequest Template Parameter
- Syntax
- ihc::waitrequest<value>
- Valid Values
- true or false
- Default Value
- false
- Description
- Adds the waitrequest signal that is asserted by the slave when it is unable to respond to a read or write request. For more information about the waitrequest signal, see "Avalon Memory-Mapped Interface Signal Roles" in Avalon Interface Specifications.
getInterfaceAtIndex Testbench Function
- Syntax
- getInterfaceAtIndex(int index)
- Description
- This testbench function is used to index into an mm_master object. It can be useful when iterating over an array and invoking a component on different indicies of the array. This function is supported only in the testbench.
- Code Example
-
int main() { // ……. for(int idx = 0; idx < N; idx++) { dut(src_mm.getInterfaceAtIndex(idx)); } // ……. }
13.18. Intel HLS Compiler Pro Edition Load-Store Unit Control
For variable-latency Avalon® Memory-Mapped (MM) Master interfaces (ihc::latency<0>), you can control the type of load-store unit (LSU) with the ihc::lsu template object and the corresponding load() and store() functions.
Template Object/Parameter/Function |
Description |
---|---|
ihc::lsu | The underlying LSU class template object |
ihc::style | Specifies the type of load-store unit. |
ihc::static_coalescing | Explicitly allows or prevents static coalescing of a load/store operation with other load/store operations. |
load | Loads data from memory into the LSU. |
store | Stores data from the LSU into memory. |
ihc::lsu Template Object
- Syntax
- ihc::lsu<template arguments >
- Valid Values
- N/A.
- Default Value
- N/A.
- Description
- The underlying LSU class object.
To learn more, review the following tutorial: <quartus_installdir>/hls/examples/tutorials/best_practices/lsu_control
ihc::style Template Parameter
- Syntax
- ihc::style<LSU_type >
- Valid Values
-
LSU_type can be
one of the following values:
- BURST_COALESCED
- PIPELINED
- Default Value
- BURST_COALESCED
- Description
- Specifies the type of load-store unit to create.
A burst-coalesced LSU buffers requests until the largest possible burst can be made.
A pipelined LSU submits requests as they are received.
ihc::static_coalescing Template Parameter
- Syntax
- ihc::static_coalescing<value >
- Valid Values
- true or false
- Default Value
- true
- Description
- Specifies whether to allow or prevent static coalescing of the load/store operation with other load/store operations.
load Function
- Syntax
- load(<memory_location>)
- Parameters
- The <memory_location> argument specifies the memory location to load data into the LSU from.
- Return Type
- Object of same type as the base type of the argument specified for <memory_location>.
- Description
- The load function loads data from a memory location specified by the <memory_location> argument and returns the data that the argument points to.
store Function
- Syntax
- store(<memory_location>, <value_to_store>)
- Parameters
- The <memory_location> argument specifies the memory
location to store data coming from the LSU.
The <value_to_store> argument is the value from the LSU to store in memory. The type is the same a the pointer base type.
- Return Type
- None.
- Description
- The store function stores data in the LSU to a memory location specified by the <memory_location> argument.
13.19. Intel HLS Compiler Pro Edition Arbitrary Precision Data Types
Data Type | Intel Header File | Description |
---|---|---|
ac_int | HLS/ac_int.h | Arbitrary-width integer support To learn more, review the following
tutorials:
|
ac_fixed | HLS/ac_fixed.h | Arbitrary-precision fixed-point number
support To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/ac_datatypes/ac_fixed_constructor |
HLS/ac_fixed_math.h | Support for some nonstandard math
functions for arbitrary-precision fixed-point data types To learn more, review the tutorial: <quartus_installdir>/hls/examples/tutorials/ac_datatypes/ac_fixed_math_library |
|
ac_complex | HLS/ac_complex.h | Complex number support |
hls_float | HLS/hls_float.h | Arbitrary-precision floating-point number support |
HLS/hls_float_math.h | Support for commonly used exponential,
logarithmic, power, and trigonometric functions. To learn more, review the following
tutorials:
|