7.2. Estimating Graph Performance
To estimate the performance of a graph on an architecture, use the --fanalyze-performance dla_compiler command option.
The dla_compiler command compiles the graph for the specified architecture to estimate its performance. The performance estimator assumes that any portions of the graph that are assigned to the CPU run inference with zero latency. That is, the performance estimate accounts only for the FPGA portions of the graph.
The performance estimator also estimates the average memory bandwidth and the memory requirements of the graph. The estimated memory requirement is typically an underestimate because the memory estimates assume one input buffer and one output buffers while the FPGA AI Suite runtime uses a default of five of each.
For DDR-free architectures, the performance estimate excludes performance of the input/output streamer. The estimate applies only to the FPGA AI Suite IP itself.
The list of required and optional arguments includes all the required and optional arguments from Compiling a Graph . In addition, the following options are specific to estimating graph performance:
Option |
Description |
|---|---|
| --fanalyze-performance | [Required] Enables the performance estimator. |
| --fassumed-fmax-core= <assumed fMAX> | [Optional] Specifies the assumed fMAX of the compiled FPGA AI Suite IP. The performance estimator does not have the ability to estimate fMAX of a given IP parameterization, nor does it know which speed grade the IP targets. Typically, the IP achieves 300 MHz or higher on a C2 Arria® 10 device.
The default fMAX depends on the FPGA device family:
|
| --fassumed-memory-bandwidth | [Optional] Specifies the available average DDR bandwidth in MB/s for each instance of the FPGA AI Suite IP. Do no set this value if the IP does not use DDR.
The default DDR bandwidth depends on the FPGA device family:
|
| --fdump-performance-report | [Optional] An optional output file for the performance estimate, otherwise the performance summary is displayed on the terminal. |
The simplest dla_compiler command format for estimating the performance of a graph is as follows:
dla_compiler \ --network-file <path to graph.xml> \ --march <path to .arch file> \ --fanalyze-performance
Performance Summary
| Metric | Description |
|---|---|
| PE-only Conv Throughput No DDR | Throughput of the PE array (that is, only layers that are mapped into convolutions) assuming that there is no limit to DDR bandwidth. |
| PE-only Conv Throughput | Estimate that accounts for the performance impact of fetching filter data from external memory. |
| Overall throughput Inf PE Buf Depth (zero MPBW) | Models the latency impact of most writes to external memory and latency of the activation modules. The latency estimate is pessimistic. |
| Overall throughput Zero PE Buf Depth | Same as the previous row, but using a optimistic methodology to estimate the latency impact. |
| Overall Throughput Inv PE Buf Depth | Estimate that accounts for external memory bottlenecks that affect the feature prefetch. The latency estimate is pessimistic. |
| Overall Throughput Zero PE Buf Depth | Same as previous row, but using an optimistic methodology to estimate the latency impact. |
Example Command
dla_compiler \ --network-file ResNet50.xml ResNet101.xml \ --march $COREDLA_ARCH/example_architectures/A10_Generic.arch \ --fanalyze-performance \ --fassumed-fmax-core=300 \ --fassumed-memory-bandwidth=19200