FPGA AI Suite Handbook

ID 863373
Date 11/21/2025
Public
Document Table of Contents

2.4.3. Model Performance

The performance estimator tool assumes the following fMAX values for FPGA devices. For information about the performance estimator too, refer to Estimating Graph Performance.
  • Agilex™ 3: 350 MHz
  • Agilex™ 5: 350 MHz
  • Agilex™ 7: 400 MHz
  • Arria® 10: 265 MHz
These assumptions are reasonable and conservative for the standard speed bin. As shown by the results in this section, the achieved fMAX of the design example typically exceeds these assumptions.

The performance results for the designs that follow were achieved using the dla_build_example_design.py script that is included with the FPGA AI Suite. The script uses a standard (-2) speed bin with a single seed and uses high-effort compiler settings.

The runtime hosts used for determining the performance results are as follows:
  • Agilex™ 7 runtime host: SUSE Linux Enterprise Server 15 host on an Intel® Xeon® processor E5-1650 @ 3.5 GHz.
This design uses a dedicated DDR interface for the IP. The batch size is 1. Performance varies based on the clock speed, the DDR latency and bandwidth.
The dla_build_example_design.py script includes the following two .qsf lines to enable non-default Quartus® Prime options during design compilation:
set_global_assignment -name ALLOW_SHIFT_REGISTER_MERGING_ACROSS_HIERARCHIES ALWAYS
set_global_assignment -name DISABLE_REGISTER_MERGING_ACROSS_HIERARCHIES OFF

The architectures in the tables that follow are in the $COREDLA_ROOT/example_architectures/ directory. Review the README file in that directory for information about each architecture.

The IP Throughput column in the tables that follow shows the performance for the portion of the graph that runs on the FPGA device. In many cases, the entire graph runs on the FPGA device. The IP Throughput is representative of performance if the IP is used in a hostless configuration.

The IP+host Throughput column in the tables that follow shows the performance including the host. The IP+host performance may be lower than IP-only performance if the host is unable to stream data to the FPGA device quickly enough, or if the host is limited by some of the processing associated with the graph (for example, the host performs NMS for the YOLOv3 graph). Achievable IP+host performance depends on the speed and loading of the host and the FPGA AI Suite IP.

Details - FPGA AI Suite 2025.3

Architecture fMAX ALMs DSPs M20Ks Registers Board
AGX7_FP16_Generic 600 MHz 33.0 k 186 516 90 k DE101
AGX7_FP16_Performance 600 MHz 101.5 k 1162 1530 299 k DE101
AGX7_Small_NoSoftmax 610 MHz 17.0 k 80 300 53 k DE101
AGX7_Small_Softmax 608 MHz 18.6 k 90 308 57 k DE101
AGX7_Generic 591 MHz 38.4 k 202 782 109 k DE101
AGX7_Performance 533 MHz 66.2 k 650 1273 172 k DE101
AGX7_Performance_Giant 480 MHz 128.4 k 1546 2360 360 k DE101
AGX7_Streaming_Ddrfree_Resnet18 541 MHz 74.4k 296 8179 177 k AG7I_DK2
AGX5_Performance 308 MHz 79.8 k 266 1151 195 k AG5_MOD3

public/mobilenet-v1-1.0-224

Architecture ALMs DSPs DDR 4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.0 k 186 2268 171 171 71.2 89.5
AGX7_FP16_Performance 101.5 k 1162 9223 579 520 71.2 89.5
AGX7_Small_NoSoftmax 17.0 k 80 2804 169 169 70.9 89.6
AGX7_Small_Softmax 18.6 k 90 2797 169 168 70.9 89.5
AGX7_Generic 38.4 k 202 3259 251 251 70.9 89.5
AGX7_Performance 66.2 k 650 8245 525 294 70.9 89.5
AGX7_Performance_Giant 128.4 k 1546 14476 1398 444 71.0 89.6

public/mobilenet-v2

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.0 k 186 3682 149 148 71.8 89.6
AGX7_FP16_Performance 101.5 k 1162 7112 381 371 71.8 89.6
AGX7_Small_NoSoftmax 17.0 k 80 4684 144 139 71.6 89.6
AGX7_Small_Softmax 18.6 k 90 4673 143 139 71.8 89.4
AGX7_Generic 38.4 k 202 3522 263 250 65.6 86.8
AGX7_Performance 66.2 k 650 6831 327 269 71.7 89.4
AGX7_Performance_Giant 128.4 k 1546 10368 1046 458 71.8 89.4

public/mobilenet-v2-1.4-224

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.0 k 186 4133 123 123 74.8 91.9
AGX7_FP16_Performance 101.5 k 1162 9000 299 293 74.8 91.9
AGX7_Generic 38.4 k 202 4181 151 146 74.7 91.8
AGX7_Performance 66.2 k 650 8341 277 237 74.7 91.8
AGX7_Performance_Giant 128.4 k 1546 13095 829 477 74.7 91.7

public/mobilenet-v3-large-1.0-224-tf

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.0 k 186 3846 172 169 75.8 92.1
AGX7_FP16_Performance 101.5 k 1162 11461 245 238 75.8 92.1
AGX7_Generic 38.4 k 202 4513 180 172 72.3 90.7
AGX7_Performance 66.2 k 650 10838 236 224 72.1 90.5
AGX7_Performance_Giant 128.4 k 1546 14525 353 284 72.6 90.6

public/resnet-50-tf

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.0 k 186 3009 32 32 76.8 92.9
AGX7_FP16_Performance 101.5 k 1162 11701 166 164 76.8 92.9
AGX7_Small_NoSoftmax 17.0 k 80 5961 28 28 77.0 92.9
AGX7_Small_Softmax 18.6 k 90 5947 28 28 77.1 92.9
AGX7_Generic 38.4 k 202 4154 60 60 77.1 92.9
AGX7_Performance 66.2 k 650 11029 156 144 76.9 92.9
AGX7_Performance_Giant 128.4 k 1546 13313 224 218 76.9 92.8

Resnet50 v1 (Caffe)

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.0 k 186 2825 38 38 74.4 91.4
AGX7_FP16_Performance 101.5 k 1162 12025 193 195 74.4 91.4
AGX7_Small_NoSoftmax 17.0 k 80 4178 37 37 74.1 91.4
AGX7_Small_Softmax 18.6 k 90 4167 37 37 74.2 91.3
AGX7_Generic 38.4 k 202 4432 72 72 74.2 91.3
AGX7_Performance 66.2 k 650 11531 185 181 74.0 91.4
AGX7_Performance_Giant 128.4 k 1546 14363 257 233 74.1 91.4

intel/unet-camvid-onnx-0001

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

AGX7_FP16_Generic 33.0 k 186 826 1.09
AGX7_FP16_Performance 101.5 k 1162 4543 7.55
AGX7_Small_NoSoftmax 17. k 80 1141 1.10
AGX7_Small_Softmax 18.6 k 90 1138 1.10
AGX7_Generic 38.4 k 202 1305 2.12
AGX7_Performance 66.2 k 650 4016 6.83
AGX7_Performance_Giant 128.4 k 1546 5430 10.69

public/yolo-v3-tf

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
AGX7_FP16_Generic 33.0 k 186 1431 4.3 4.1 62.27 31.58
AGX7_FP16_Performance 101.5 k 1162 6340 27.9 27.9 62.25 31.58
AGX7_Generic 38.4 k 202 1878 8.1 7.6 62.28 31.49
AGX7_Performance 66.2 k 650 5806 25.4 10.5 62.22 31.47
AGX7_Performance_Giant 128.4 k 1546 8585 38.4 26.9 62.25 31.46

public/yolo-v3-tiny-tf

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
AGX7_FP16_Generic 33.0 k 186 1185 41.6 35.4 35.79 14.77
AGX7_FP16_Performance 101.5 k 1162 4552 116.6 115.3 35.81 14.78
AGX7_Generic 38.4 k 202 2314 81.7 64.1 35.76 14.74
AGX7_Performance 66.2 k 650 4187 107.2 34.7 35.73 14.72
AGX7_Performance_Giant 128.4 k 1546 5451 100.8 54.2 35.81 14.75

public/yolo-v8-nano detection

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
AGX7_FP16_Performance 101.5 k 1162 6780 93 91 51.13 36.51
AGX7_Generic 38.4 k 202 2412 49 39 51.05 36.46
AGX7_Performance 66.2 k 650 6416 92 31 51.05 36.45

public/yolo-v8-nano classification

Architecture ALMs DSPs DDR4

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Performance 101.5 k 1162 9718 1301 67.88 87.76
AGX7_Generic 38.4 k 202 5429 932 67.86 87.74
AGX7_Performance 66.2 k 650 9715 1296 67.76 87.82

public/squeezenet1.1

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.0 k 186 634 219 220 58.5 81.1
AGX7_FP16_Performance 101.5 k 1162 4598 924 774 58.5 81.1
AGX7_Small_NoSoftmax 17.0 k 80 929 221 221 58.5 81.0
AGX7_Small_Softmax 18.6 k 90 926 220 220 58.5 81.0
AGX7_Generic 38.4 k 202 1704 529 529 58.5 81.0
AGX7_Performance 66.2 k 650 4253 852 265 58.4 81.0
AGX7_Performance_Giant 128.4 k 1546 5213 864 451 58.3 81.1

public/i3d_rgb_tf

Architecture ALMs DSPs DDR4

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.0 k 186 442 0.61 65.79 82.89
AGX7_FP16_Performance 101.5 k 1162 2545 4.11 66.22 82.89
AGX7_Small_NoSoftmax 17.0 k 80 491 0.58 65.35 82.89
AGX7_Small_Softmax 18.6 k 90 490 0.58 65.57 83.11
AGX7_Generic 38.4 k 202 733 1.34 65.57 83.11
AGX7_Performance 66.2 k 650 2286 3.69 65.13 83.11
AGX7_Performance_Giant 128.4 k 1546 2713 4.22 65.79 82.89

ResNet50 V1

Architecture ALMs DSPs DDR4 [MB/s] IP Throughput [fps]
AGX5_Performance 79.8 k 266 4657 74.78

ResNet18 (PyTorch)

The table that follows presents performance measurements for a DDR-free streaming architecture built for the ResNet-18 PyTorch. This analysis focuses on two metrics: core IP throughput and IP throughput:
  • Core IP throughput

    This metric isolates the performance of the core FPGA AI Suite IP, excluding the input and output streamer. It highlights the computational efficiency of the processing elements within the architecture

  • IP throughput

    This metric captures the overall inference performance of the overlay IP, including the input and output streaming components.

Architecture ALMs DSPs M20ks IP Throughput [fps]
AGX7_Streaming_Ddrfree_Resnet18 74.4 k 296 8179 173
1 Terasic* DE10-Agilex Development Board (DE10-Agilex-B2E2)
2 Agilex™ 7 FPGA I-Series Development Kit ES2 (DK-DEV-AGI027RBES)
3 Agilex™ 5 FPGA E-Series 065B Modular Development Kit (MK-A5E065BB32AES1)
* DDR is estimated minimum average read + write (that is, read + write require at least this much bandwidth on average). Peak bandwidth is higher.