3.5.1. Generating an Architecture for Highest Performance

Intel® FPGA AI Suite: Compiler Reference Manual

Download PDF

ID 768972

Date 9/06/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

3.5.1. Generating an Architecture for Highest Performance

To generate an architecture that is optimized for a graph or set of graphs, the Intel® FPGA AI Suite architecture optimizer uses a base architecture and modifies parameters to achieve the highest throughput in frames per second (fps). The best architecture is saved as an architecture description file with a file name based on the architecture parameters.

Some parameters (such as the precision specified in the .arch file) are not modified during optimization. Parameters that are not optimized take their value from the input architecture file specified by the required --march option. Important IP parameters to set manually include the arch_precision, num_interleaved_features, and num_interleaved_filters parameters. IP parameters are described in the Intel® FPGA AI Suite IP Reference Manual .

Architecture optimization is a slow process. The optimizer saves the current best architecture once every 30 seconds so that you can monitor progress. These intermediate architecture files have the following file name format: current_best_<number of fps>_<index>.arch .

When you use the --mtarget-fps option, the architecture optimizer might use in excess of 100GB of memory. If this memory is not available, the operating system might stop the dla_compiler process, and the user shell prints the message " Killed ". For information about how to use the COREDLA_TARGET_FPS_THREAD_LIMIT environment variable to control the resource consumption of the architecture optimizer, refer to the description of --mtarget-fps in Architecture Optimizer Options (dla_compiler Command Options).

If multiple input graphs are specified by using multiple --network-file options, then the optimizer calculates a weighted objective. For example, if two graphs are specified and have throughput values of fps₁ and fps₂ then the overall throughput is maximized using the user-specified weights w₁ and w₂ as follows:

${fps}_{total} = 1 / (\frac{w_{1}}{{fps}_{1}} + \frac{w_{2}}{{fps}_{2}})$

The architecture optimization process uses the compiler, the performance estimator, and the area estimator. Accordingly, its command line options are as follows:

All required and optional arguments from Compiling a Graph , including the required --march option.
All optional arguments from Estimating the Performance of a Graph .
All optional arguments from Estimating the Area of an Architecture .
The following arguments for architecture optimization:

Option	Description
--gen-arch	[Required] Enable the architecture optimizer.
--mmax-resources= `<max_ALMs>` , `< max_M20K_blocks>` , `<max_DSP_blocks>`	[Optional] Sets the maximum number of resources that the output architecture can use (as estimated by the area estimator). Specify the resources as a comma-delimited sequence of max ALMs, M20k blocks, and DSP blocks. If you do not specify this option, the architecture uses as many resources as needed. The optimizer might use more resources than are available on the FPGA device.
--mmax-resources-alm-util= `<%_max_ALM_utilization>`	[Optional] Set this option only if you also set the --mmax-resources option. This option sets a percentage target of the `<max_ALMs>` value from the --mmax-resources option that architecture optimizer aims to use for the architecture. The remaining ALMs are used by the Intel® Quartus® Prime software to improve timing closure. The default value is `100` (100% utilization). Designers typically target full-chip logic utilization values lower than `100` (100% utilization) to improve routing and timing closure of the design. As an example, a target of 80% ALM utilization (`--mmax-resources-alm-util=80`) tells the architecture optimizer that it should use only 80% of the ALMs that were specified by the --mmax-resources option. The other 20% are used by Intel® Quartus® Prime software to improve timing closure.
--mtarget-fps	[Optional] Sets the minimum frames-per-second (fps) that the output architecture must achieve (as estimated by the performance estimator). This option significantly increases the runtime and memory requirements of the architecture optimizer. The architecture optimizer can use in excess of 100GB of RAM when using this option, and thus requires a server-class machine.
--interleave-search	[Optional] Causes the architecture optimizer to evaluate different legal feature and filter interleave options. This significantly increases the run time of the architecture optimizer, by as much as 50%. Due to the run time penalty, after an optimal interleave has been found for a given graph, the recommendation is to place the optimized interleave into the initial input .arch file and avoid using --interleave-search during any future fine-tuning. For more information, refer to "Parameter group: pe_array" topic of Intel® FPGA AI Suite IP Reference Manual .
--gen-min-sb	[Optional] Minimum size stream buffer. Can be specified explicitly to reduce the search space for the optimizer. For more information about stream buffer size, refer to the "Parameter: `stream_buffer_depth` " section in the "Parameter group: Global Parameters" topic of Intel® FPGA AI Suite IP Reference Manual .
--network-weightings=" `<network_weight_1>` `<network_weight_2>` `<network_weight_3>...<network_weight_n>` "	[Optional] Space-delimited specification of network weights when multiple networks are specified. If not specified, then all networks are equally weighted.
--gen-arch-file	[Optional] Name of the output architecture (.arch) file.
--max-archsetM-percentage --arch-limit --time-limit	[Optional] Used when performing a larger search of the optimization space, as described in `dla_compiler --help`. Not recommended for general use.

The architecture optimizer supports only 1xN and Nx1 interleave, as described in the release notes. Most architectures use a compatible interleave, but the A10_Generic.arch architecture file (included in the set of example architectures for the IP) uses a 2x2 interleave that must be modified to 4x1 (recommended) or 1x4.

For more information about modifying interleaving, refer to the "Parameters: pe_array/num_interleaved_features, pe_array/ num_interleaved_filters " section in the "Parameter Group: pe_array " topic of Intel® FPGA AI Suite IP Reference Manual .

The simplest command format to optimize an Architecture Description (.arch) file for a graph is as follows:

dla_compiler \
   --gen-arch \
   --mmax-resources=<max_ALMs>,<max_M20K_blocks>,<max_DSP_blocks> \
   --network-file <path or paths to graph.xml> \
   --march=<path to input .arch file>

Example Command

dla_compiler \
   --gen-arch \
   --mmax-resources=427200,2713,1518 \ 
   --gen-min-sb=2048 \
   --network-file ResNet50.xml Mobilenet_v1.xml \
   --march=./example_architecture/A10_Performance.arch \
   --mmax-resources-alm-util=75 \  
   --fassumed-fmax-core=300 \ 
   --network-weightings="1 2"

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® FPGA AI Suite: Compiler Reference Manual

3.5.1. Generating an Architecture for Highest Performance

Example Command