8.2.6.1. [OFS-PCIE] Inference on Image Classification Graphs

FPGA AI Suite: Design Examples User Guide

Download PDF

ID 848957

Date 4/30/2025

Version

Public

8.2.6.1. [OFS-PCIE] Inference on Image Classification Graphs

The demonstration application requires the OpenVINO™ device flag to be either HETERO:FPGA,CPU for heterogeneous execution or HETERO:FPGA for FPGA-only execution.

The dla_benchmark demonstration application runs five inference requests (batches) in parallel on the FPGA, by default, to achieve optimal system performance. To measure steady state performance, you should run multiple batches (using the niter flag) because the first iteration is significantly slower with FPGA devices.

The dla_benchmark demonstration application also supports multiple graphs in the same execution. You can place more than one graphs or compiled graphs as input, separated by commas.

Each graph can have either a different input dataset or use a commonly shared dataset among all graphs. Each graph requires an individual ground_truth_file file, separated by commas. If some ground_truth_file files are missing, the dla_benchmark continues to run and ignore the missing ones.

When multi-graph is enabled, the -niter flag represents the number of iterations for each graph, so the total number of iterations becomes -niter × number of graphs.

The dla_benchmark demonstration application switches graphs after submitting -nireq requests. The request queue holds the number of requests up to -nireq × number of graphs. This limit is constrained by the DMA CSR descriptor queue size (64 per hardware instance).

The board you use determines the number of instances that you can compile the FPGA AI Suite hardware for. For the Agilex™ 7 FPGA I-Series Development Kit and Intel® FPGA SmartNIC N6001-PL Platform, you can compile up to four instances with the same architecture on all instances. Some large architecture might not fit on the board for four instances, such as AGX7_Performance_Giant.

Each instance accesses one of the DDR banks on the board and executes the graph independently. This optimization enables multiple batches to run in parallel, limited by the number of DDR banks available. Each inference request created by the demonstration application is assigned to one of the instances in the FPGA plugin.

To enable memory-mapped device (MMD) debug messages when you run the dla_benchmark demonstration application. set the MMD_ENABLE_DEBUG environment variable as follows:

MMD_ENABLE_DEBUG=1

Also, you can test full DDR write and read back functionality when the dla_benchmark demonstration application runs by setting the COREDLA_RUNTIME_MEMORY_TEST environment variable as follows:

COREDLA_RUNTIME_MEMORY_TEST=1

To ensure that batches are evenly distributed between the instances, you must choose an inference request batch size that is a multiple of the number of FPGA AI Suite instances. For example, with two instances, specify the batch size as six (instead of the OpenVINO™ default of five) to ensure that the experiment meets this requirement.

The example usage that follows has the following assumptions:

A Model Optimizer IR .xml file is in demo/models/public/resnet-50-tf/FP32/
An image set is in demo/sample_images/
The board is programmed with a bitstream that corresponds to AGX7_Performance.arch

binxml=$COREDLA_ROOT/demo/models/public/resnet-50-tf/FP32

imgdir=$COREDLA_ROOT/demo/sample_images

cd $COREDLA_ROOT/runtime/build_Release

./dla_benchmark/dla_benchmark \
   -b=1 \
   -m $binxml/resnet-50-tf.xml \
   -d=HETERO:FPGA,CPU \
   -i $imgdir \
   -niter=4 \
   -plugins ./plugins.xml \
   -arch_file $COREDLA_ROOT/example_architectures/AGX7_Performance.arch \
   -api=async \
   -groundtruth_loc $imgdir/TF_ground_truth.txt \
   -perf_est \
   -nireq=8 \
   -bgr

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

FPGA AI Suite: Design Examples User Guide

8.2.6.1. [OFS-PCIE] Inference on Image Classification Graphs