FPGA AI Suite: IP Reference Manual

ID 768974
Date 4/21/2025
Public

Visible to Intel only — GUID: vob1659542922701

Ixiasoft

Document Table of Contents

2.2. Model Performance

The performance estimator tool (described in the FPGA AI Suite Compiler Reference Manual ) assumes the following fMAX values for FPGA devices:
  • Agilex™ 5: 350 MHz
  • Agilex™ 7: 400 MHz
  • Arria® 10: 265 MHz
These assumptions are reasonable and conservative for the standard speed bin. As shown by the results in this section, the achieved fMAX of the example design typically exceeds these assumptions.

The performance results for the designs that follow were achieved using the dla_build_example_design.py script that is included with the FPGA AI Suite. The script uses a standard (-2) speed bin with a single seed and uses high-effort compiler settings.

The runtime hosts used for determining the performance results are as follows:
  • Agilex™ 7 runtime host: SUSE Linux Enterprise Server 15 host on an Intel® Xeon® processor E5-1650 @ 3.5 GHz.
This design uses a dedicated DDR interface for the IP. The batch size is 1. Performance varies based on the clock speed, the DDR latency and bandwidth.
The dla_build_example_design.py script includes the following two .qsf lines to enable non-default Quartus® Prime options during design compilation:
set_global_assignment -name ALLOW_SHIFT_REGISTER_MERGING_ACROSS_HIERARCHIES ALWAYS
set_global_assignment -name DISABLE_REGISTER_MERGING_ACROSS_HIERARCHIES OFF

The architectures in the tables that follow are in the $COREDLA_ROOT/example_architectures/ directory. Review the README file in that directory for information about each architecture.

The IP Throughput column in the tables that follow shows the performance for the portion of the graph that runs on the FPGA device. In many cases, the entire graph runs on the FPGA device. The IP Throughput is representative of performance if the IP is used in a hostless configuration.

The IP+host Throughput column in the tables that follow shows the performance including the host. The IP+host performance may be lower than IP-only performance if the host is unable to stream data to the FPGA device quickly enough, or if the host is limited by some of the processing associated with the graph (for example, the host performs NMS for the YOLOv3 graph). Achievable IP+host performance depends on the speed and loading of the host and the FPGA AI Suite IP.

Details - FPGA AI Suite 2025.1

Architecture fMAX ALMs DSPs M20Ks Registers Board
AGX7_FP16_Generic 615 MHz 33.6 k 186 516 105 k DE101
AGX7_FP16_Performance 600 MHz 103.6 k 1162 1543 327 k DE101
AGX7_Small_NoSoftmax 606 MHz 17.0 k 80 300 49 k DE101
AGX7_Small_Softmax 600 MHz 18.4 k 90 308 59 k DE101
AGX7_Generic 605 MHz 38.7 k 202 782 121 k DE101
AGX7_Performance 541 MHz 70.4 k 650 1286 207 k DE101
AGX7_Performance_Giant 531 MHz 125.5 k 1546 2386 363 k DE101
AGX_Streaming_Ddrfree_Resnet18 536 MHz 77.7k 296 8048 195052 AG7I_DK2
AGX5_Performance 311 MHz 83.6k 266 1151 251k AG5_MOD3

public/mobilenet-v1-1.0-224

Architecture ALMs DSPs DDR 4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 2323 176 175 71.2 89.5
AGX7_FP16_Performance 103.6 k 1162 9121 572 564 71.2 89.5
AGX7_Small_NoSoftmax 17.0 k 80 2788 168 168 70.9 89.6
AGX7_Small_Softmax 18.4 k 90 2761 167 166 70.9 89.5
AGX7_Generic 38.7 k 202 3331 257 251 70.9 89.5
AGX7_Performance 70.4 k 650 8355 532 387 70.9 89.5
AGX7_Performance_Giant 125.5 k 1546 15278 1476 719 71.0 89.6

public/mobilenet-v2

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 3759 152 152 71.8 89.6
AGX7_FP16_Performance 103.6 k 1162 7012 375 374 71.8 89.6
AGX7_Small_NoSoftmax 17.0 k 80 4660 143 139 71.6 89.6
AGX7_Small_Softmax 18.4 k 90 4624 142 139 71.8 89.4
AGX7_Generic 38.7 k 202 2748 205 198 71.8 89.4
AGX7_Performance 70.4 k 650 6877 329 288 71.7 89.4
AGX7_Performance_Giant 125.5 k 1546 10878 1098 670 71.8 89.4

public/mobilenet-v2-1.4-224

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 4220 126 125 74.8 91.9
AGX7_FP16_Performance 103.6 k 1162 8960 298 291 74.8 91.9
AGX7_Generic 38.7 k 202 4253 153 150 74.7 91.8
AGX7_Performance 70.4 k 650 8425 280 236 74.7 91.8
AGX7_Performance_Giant 125.5 k 1546 13593 860 633 74.7 91.7

public/mobilenet-v3-large-1.0-224-tf

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 3924 175 173 75.8 92.1
AGX7_FP16_Performance 103.6 k 1162 11421 244 238 75.8 92.1
AGX7_Generic 38.7 k 202 4589 183 179 72.3 90.7
AGX7_Performance 70.4 k 650 10939 238 232 72.1 90.5
AGX7_Performance_Giant 125.5 k 1546 14897 362 322 72.6 90.6

public/resnet-50-tf

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 3080 32 32 76.8 92.9
AGX7_FP16_Performance 103.6 k 1162 11567 164 162 76.8 92.9
AGX7_Small_NoSoftmax 17.0 k 80 5929 28 28 77.0 92.9
AGX7_Small_Softmax 18.4 k 90 5877 28 28 77.1 92.9
AGX7_Generic 38.7 k 202 4237 61 61 77.1 92.9
AGX7_Performance 70.4 k 650 11058 156 151 76.9 92.9
AGX7_Performance_Giant 125.5 k 1546 14023 236 230 76.9 92.8

Resnet50 v1 (Caffe)

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 2892 39 39 74.4 91.4
AGX7_FP16_Performance 103.6 k 1162 12271 197 192 74.4 91.4
AGX7_Small_NoSoftmax 17.0 k 80 4153 37 37 74.1 91.4
AGX7_Small_Softmax 18.4 k 90 4112 36 36 74.2 91.3
AGX7_Generic 38.7 k 202 4522 74 74 74.2 91.3
AGX7_Performance 70.4 k 650 11615 187 181 74.0 91.4
AGX7_Performance_Giant 125.5 k 1546 15095 271 251 74.1 91.4

intel/unet-camvid-onnx-0001

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

AGX7_FP16_Generic 33.6 k 186 846 1.11
AGX7_FP16_Performance 103.6 k 1162 4546 7.56
AGX7_Small_NoSoftmax 17.0 k 80 1134 1.10
AGX7_Small_Softmax 18.4 k 90 1123 1.09
AGX7_Generic 38.7 k 202 1332 2.16
AGX7_Performance 70.4 k 650 4072 6.92
AGX7_Performance_Giant 125.5 k 1546 5916 11.65

public/yolo-v3-tf

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
AGX7_FP16_Generic 33.6 k 186 1465 4.4 4 62.27 31.58
AGX7_FP16_Performance 103.6 k 1162 6336 27.9 28 62.25 31.58
AGX7_Generic 38.7 k 202 1919 8.3 8 62.28 31.49
AGX7_Performance 70.4 k 650 5797 25.3 11 62.22 31.47
AGX7_Performance_Giant 125.5 k 1546 9210 41.2 30 62.25 31.46

public/yolo-v3-tiny-tf

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
AGX7_FP16_Generic 33.6 k 186 1214 43 37 35.79 14.77
AGX7_FP16_Performance 103.6 k 1162 4542 116 114 35.81 14.78
AGX7_Generic 38.7 k 202 2361 83 67 35.76 14.74
AGX7_Performance 70.4 k 650 4244 109 38 35.73 14.72
AGX7_Performance_Giant 125.5 k 1546 5917 109 65 35.81 14.75

public/yolo-v8-nano detection

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Detection mAP @0.5 Detection mAP @0.5:0.95
AGX7_FP16_Performance 103.6 k 1162 6854 94 92 51.15 36.52
AGX7_Generic 38.7 k 202 2455 50 41 51.14 36.51
AGX7_Performance 70.4 k 650 6293 92 32 51.10 36.48

public/yolo-v8-nano classification

Architecture ALMs DSPs DDR4

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Performance 103.6 k 1162 8087 1082 67.90 87.72
AGX7_Generic 38.7 k 202 5534 950 67.88 87.86
AGX7_Performance 70.4 k 650 9700 1294 67.70 87.66

public/squeezenet1.1

Architecture ALMs DSPs DDR4

[MB/s]

IP Throughput

[fps]

IP+host Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 650 225 225 58.5 81.1
AGX7_FP16_Performance 103.6 k 1162 4828 970 924 58.5 81.1
AGX7_Small_NoSoftmax 17.0 k 80 924 220 220 58.5 81.0
AGX7_Small_Softmax 18.4 k 90 914 217 218 58.5 81.0
AGX7_Generic 38.7 k 202 1748 543 544 58.5 81.0
AGX7_Performance 70.4 k 650 4356 872 395 58.4 81.0
AGX7_Performance_Giant 125.5 k 1546 5755 953 650 58.3 81.1

public/i3d_rgb_tf

Architecture ALMs DSPs DDR4

[MB/s]

Throughput

[fps]

Top-1

[%]

Top-5

[%]

AGX7_FP16_Generic 33.6 k 186 453 0.62 65.79 82.89
AGX7_FP16_Performance 103.6 k 1162 2543 4.11 66.22 82.89
AGX7_Small_NoSoftmax 17.0 k 80 489 0.57 65.35 82.89
AGX7_Small_Softmax 18.4 k 90 483 0.57 65.57 83.11
AGX7_Generic 38.7 k 202 748 1.37 65.57 83.11
AGX7_Performance 70.4 k 650 2318 3.74 65.13 83.11
AGX7_Performance_Giant 125.5 k 1546 2968 4.62 65.79 82.89

ResNet50 V1

Architecture ALMs DSPs DDR4 [MB/s] IP Throughput [fps]
AGX5_Performance 83.6 k 266 4665.046 74.91

ResNet18 (PyTorch)

Architecture ALMs DSPs M20ks IP Throughput [fps]
AGX7_Streaming_Ddrfree_Resnet18 77.7 k 296 8048 168
1 Terasic* DE10-Agilex Development Board (DE10-Agilex-B2E2)
2 Agilex™ 7 FPGA I-Series Development Kit ES2 (DK-DEV-AGI027RBES)
3 Agilex™ 5 FPGA E-Series 065B Modular Development Kit (MK-A5E065BB32AES1)
* DDR is estimated minimum average read + write (that is, read + write require at least this much bandwidth on average). Peak bandwidth is higher.