8.2.6.1. [OFS-PCIE] Inference on Image Classification Graphs
The demonstration application requires the OpenVINO™ device flag to be either HETERO:FPGA,CPU for heterogeneous execution or HETERO:FPGA for FPGA-only execution.
The dla_benchmark demonstration application runs five inference requests (batches) in parallel on the FPGA, by default, to achieve optimal system performance. To measure steady state performance, you should run multiple batches (using the niter flag) because the first iteration is significantly slower with FPGA devices.
The dla_benchmark demonstration application also supports multiple graphs in the same execution. You can place more than one graphs or compiled graphs as input, separated by commas.
Each graph can have either a different input dataset or use a commonly shared dataset among all graphs. Each graph requires an individual ground_truth_file file, separated by commas. If some ground_truth_file files are missing, the dla_benchmark continues to run and ignore the missing ones.
When multi-graph is enabled, the -niter flag represents the number of iterations for each graph, so the total number of iterations becomes -niter × number of graphs.
The dla_benchmark demonstration application switches graphs after submitting -nireq requests. The request queue holds the number of requests up to -nireq × number of graphs. This limit is constrained by the DMA CSR descriptor queue size (64 per hardware instance).
The board you use determines the number of instances that you can compile the FPGA AI Suite hardware for. For the Agilex™ 7 FPGA I-Series Development Kit and Intel® FPGA SmartNIC N6001-PL Platform, you can compile up to four instances with the same architecture on all instances. Some large architecture might not fit on the board for four instances, such as AGX7_Performance_Giant.
Each instance accesses one of the DDR banks on the board and executes the graph independently. This optimization enables multiple batches to run in parallel, limited by the number of DDR banks available. Each inference request created by the demonstration application is assigned to one of the instances in the FPGA plugin.
MMD_ENABLE_DEBUG=1
COREDLA_RUNTIME_MEMORY_TEST=1
To ensure that batches are evenly distributed between the instances, you must choose an inference request batch size that is a multiple of the number of FPGA AI Suite instances. For example, with two instances, specify the batch size as six (instead of the OpenVINO™ default of five) to ensure that the experiment meets this requirement.
- A Model Optimizer IR .xml file is in demo/models/public/resnet-50-tf/FP32/
- An image set is in demo/sample_images/
- The board is programmed with a bitstream that corresponds to AGX7_Performance.arch
binxml=$COREDLA_ROOT/demo/models/public/resnet-50-tf/FP32 imgdir=$COREDLA_ROOT/demo/sample_images cd $COREDLA_ROOT/runtime/build_Release ./dla_benchmark/dla_benchmark \ -b=1 \ -m $binxml/resnet-50-tf.xml \ -d=HETERO:FPGA,CPU \ -i $imgdir \ -niter=4 \ -plugins ./plugins.xml \ -arch_file $COREDLA_ROOT/example_architectures/AGX7_Performance.arch \ -api=async \ -groundtruth_loc $imgdir/TF_ground_truth.txt \ -perf_est \ -nireq=8 \ -bgr