7.2. Estimating Graph Performance

FPGA AI Suite Handbook

Download PDF

ID 863373

Date 12/08/2025

Version

Public

7.2. Estimating Graph Performance

To estimate the performance of a graph on an architecture, use the --fanalyze-performance dla_compiler command option.

The dla_compiler command compiles the graph for the specified architecture to estimate its performance. The performance estimator assumes that any portions of the graph that are assigned to the CPU run inference with zero latency. That is, the performance estimate accounts only for the FPGA portions of the graph.

The performance estimator also estimates the average memory bandwidth and the memory requirements of the graph. The estimated memory requirement is typically an underestimate because the memory estimates assume one input buffer and one output buffers while the FPGA AI Suite runtime uses a default of five of each.

For DDR-free architectures, the performance estimate excludes performance of the input/output streamer. The estimate applies only to the FPGA AI Suite IP itself.

The list of required and optional arguments includes all the required and optional arguments from Compiling a Graph . In addition, the following options are specific to estimating graph performance:

Option	Description
--fanalyze-performance	[Required] Enables the performance estimator.
--fassumed-fmax-core= <assumed f_MAX>	[Optional] Specifies the assumed f_MAX of the compiled FPGA AI Suite IP. The performance estimator does not have the ability to estimate f_MAX of a given IP parameterization, nor does it know which speed grade the IP targets. Typically, the IP achieves 300 MHz or higher on a C2 Arria® 10 device. The default f_MAX depends on the FPGA device family: Arria® 10 - 265 MHz Agilex™ 3 - 350 MHz Agilex™ 5 - 350 MHz Agilex™ 7 - 500 MHz Cyclone® 10 GX - 265 MHz Stratix® 10 - 265 MHz
--fassumed-memory-bandwidth	[Optional] Specifies the available average DDR bandwidth in MB/s for each instance of the FPGA AI Suite IP. Do no set this value if the IP does not use DDR. The default DDR bandwidth depends on the FPGA device family: Arria® 10 - 19200 MB/s Agilex™ 3 - 6400 MB/s Agilex™ 5 - 6400 MB/s Agilex™ 7 - 21328 MB/s Cyclone® 10 GX - 6400 MB/s Stratix® 10 - 19200 MB/s
--fdump-performance-report	[Optional] An optional output file for the performance estimate, otherwise the performance summary is displayed on the terminal.

The simplest dla_compiler command format for estimating the performance of a graph is as follows:

dla_compiler \
   --network-file <path to graph.xml> \
   --march <path to .arch file> \
   --fanalyze-performance

Performance Summary

The performance summary consists of the following metrics:

Table 19. Performance Summary Metrics Provided by the FPGA AI Suite Compiler
Metric	Description
PE-only Conv Throughput No DDR	Throughput of the PE array (that is, only layers that are mapped into convolutions) assuming that there is no limit to DDR bandwidth.
PE-only Conv Throughput	Estimate that accounts for the performance impact of fetching filter data from external memory.
Overall throughput Inf PE Buf Depth (zero MPBW)	Models the latency impact of most writes to external memory and latency of the activation modules. The latency estimate is pessimistic.
Overall throughput Zero PE Buf Depth	Same as the previous row, but using a optimistic methodology to estimate the latency impact.
Overall Throughput Inv PE Buf Depth	Estimate that accounts for external memory bottlenecks that affect the feature prefetch. The latency estimate is pessimistic.
Overall Throughput Zero PE Buf Depth	Same as previous row, but using an optimistic methodology to estimate the latency impact.

The final throughput number is further adjusted to more closely match on-board performance measurements. The majority of the adjustment is to account for the difference in practical versus theoretical DDR bandwidth.

Example Command

dla_compiler \
   --network-file ResNet50.xml ResNet101.xml \
   --march $COREDLA_ARCH/example_architectures/A10_Generic.arch \
   --fanalyze-performance \
   --fassumed-fmax-core=300 \
   --fassumed-memory-bandwidth=19200

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

FPGA AI Suite Handbook

7.2. Estimating Graph Performance

Performance Summary

Example Command