High Bandwidth Memory (HBM2E) Interface Intel Agilex® 7 M-Series FPGA IP Design Example User Guide

ID 773266
Date 12/04/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

3.4. Using the HBM2E Design Example for Performance Testing

You can use the HBM2E design example for performance testing.

Fabric NoC Option

For purely sequential workloads such as those included in the design example, a 256-bit interface needs to operate at very high fabric frequencies to match the sequential throughput of an HBM2E pseudo-channel. System-level performance is affected by the fabric NoC option, initiator-to-target NoC mapping, initiator placement, and clock frequencies that you choose.

Note: The abstract NoC simulation model is not currently suitable for performance testing. Accurate performance figures can only be obtained by running the design example on hardware.

You should choose the 512-bit wide data paths option to maximize system performance for both sequential reads and writes. When you choose this option you must specify a clock frequency for the NoC initiator bridge hardware; that frequency should be as high as possible. Because the data paths are 512-bits wide the core clock frequency can be relatively lower — for -1 and -2 devices it must be at least 350MHz, and for -3 devices it must be at least 250MHz.

If you are only interested in measuring read throughput, choose the 256-bit write and 512 bit read data paths, read data path has its own clock option. The traffic generators and write data path operate at the core clock frequency, which limits overall throughput. The 512-bit wide read data path run at any frequency from 350MHz upwards (for -1 and -2 devices).

Initiator to Target NoC Mapping

You must choose an initiator-to-target NoC mapping of 1-to-1 full address connection to ensure that each HBM2E pseudo-channel receives purely sequential accesses. When you choose the 16-to-16 cross-bar connection option, each pseudo-channel receives commands from all 16 initiators, resulting in an overall random traffic pattern.

Note: Random-access workloads result in a big drop in memory controller efficiency, and therefore the fabric NoC is not required to saturate the throughput of an HBM2E pseudo-channel with this kind of workload.

Initiator Placement

The placement of initiators affects core Fmax; for best performance you should use explicit location assignments that make it easy for the Intel® Quartus® Prime software to spread the traffic generator logic across the device. By placing half of the initiators in the left and right halves of the device, you also minimize the possibility of NoC congestion.

You can use the following placement of the design example initiators to get a good combination of Fmax and throughput when your design uses the top NoC. These assignments apply to configurations in which all HBM2E channels are enabled with a 256-bit data mode and a shared AXI4-Lite is not configured.

set_location_assignment NOCINITIATOR_X53_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_11|initiator_inst_0
set_location_assignment NOCINITIATOR_X79_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_15|initiator_inst_0
set_location_assignment NOCINITIATOR_X94_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_9|initiator_inst_0
set_location_assignment NOCINITIATOR_X105_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_13|initiator_inst_0
set_location_assignment NOCINITIATOR_X134_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_10|initiator_inst_0
set_location_assignment NOCINITIATOR_X150_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_14|initiator_inst_0
set_location_assignment NOCINITIATOR_X161_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_8|initiator_inst_0
set_location_assignment NOCINITIATOR_X188_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_12|initiator_inst_0
set_location_assignment NOCINITIATOR_X215_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_3|initiator_inst_0
set_location_assignment NOCINITIATOR_X242_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_7|initiator_inst_0
set_location_assignment NOCINITIATOR_X258_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_1|initiator_inst_0
set_location_assignment NOCINITIATOR_X269_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_5|initiator_inst_0
set_location_assignment NOCINITIATOR_X296_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_2|initiator_inst_0
set_location_assignment NOCINITIATOR_X311_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_6|initiator_inst_0
set_location_assignment NOCINITIATOR_X322_Y417_N202 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_0|initiator_inst_0
set_location_assignment NOCINITIATOR_X357_Y417_N204 -to noc_initiator_with_wstrb|noc_initiator_with_wstrb|iniu_4|initiator_inst_0

This placement aims to place each initiator of the 1-to-1 connection topology close to the NoC target used to access the corresponding HBM2E pseudo-channel.

Clock Frequency Selection

When you configure the fabric NoC to operate with an independent clock for the NoC bridge hardware, or a separate clock for the read data path, the additional clock is generated by the same PLL that provides the core clock. The PLL imposes a relationship between these clock frequencies. Because the NoC initiator interface is only 256 bits wide, its frequency has the biggest impact on system level throughput.

When you choose the 512-bit wide data paths fabric NoC option, the NoC initiator hardware in the design example can be clocked at its highest allowable frequency. Corresponding to this frequency, a variety of core clock frequencies are possible. The following table lists a subset of the core clock frequencies that are compatible with the recommended NoC bridge hardware clock frequency for each of the device speed grades. The core clock frequency recommended for easy timing closure of the Design Example is highlighted in bold.

Table 7.  Compatible Core Clock Frequencies
Device Speed grade NoC Bridge Hardware Clock Frequency Core Clock Frequencies
-1 660 MHz 528 MHz 495 MHz 440 MHz 396 MHz
-2 630 MHz 504 MHz 450 MHz 420 MHz 378 MHz
-3 450 MHz 375 MHz 350 MHz 300 MHz 250 MHz