Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 10/04/2021
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

8.2. Optimize Global Memory Accesses

The Intel® FPGA SDK for OpenCL™ Offline Compiler uses SDRAM as global memory. By default, the offline compiler configures global memory in a burst-interleaved configuration. The offline compiler interleaves global memory across each of the external memory banks.

In most circumstances, the default burst-interleaved configuration leads to the best load balancing between the memory banks. However, in some cases, you might want to partition the banks manually as two non-interleaved (and contiguous) memory regions to achieve better load balancing.

The figure below illustrates the differences in memory mapping patterns between burst-interleaved and non-interleaved memory partitions.

Figure 79. Global Memory Partitions


Global Memory Bandwidth Use

To ensure the global memory bandwidth listed in the board specification file is utilized completely, calculating the kernel bandwidth use is beneficial. The report.html file also displays the kernel bandwidth values in the global memory view of the System Viewer. The following formulas explain how you can calculate this value on a per-LSU basis:

Figure 80. Formulas for Calculating Kernel Bandwidth Use

The LSU bandwidth equation is the minimum of three bottlenecks you need to calculate the use of global memory bandwidth. The remaining equations represent three bottlenecks that can limit the LSU bandwidth. These formulas represent the theoretical maximum bandwidth an LSU may consume, ignoring all other LSUs. The actual bandwidth depends on the LSU's access pattern and the interconnect's arbitration between all LSUs. To get an estimate of the overall bandwidth, a sum of the LSU bandwidths is available in the controller of the global memory view of the System Viewer.

The following table describes the variables used in the above equations:

Variable Description
KWIDTH Byte-width of the LSU on the kernel. In the report.html file, it is referred to as WIDTH.
MWIDTH Byte-width of the LSU facing the external memory. In the report.html file, it is referred to as the <Memory Name>_Width.
FMAX Clock speed of the kernel in MHz. In the report.html file, you can identify this as the design’s clock speed.
MaxBandwidth Maximum bandwidth (measured in MB/s) the global memory can achieve. You can find this in the board_spec.xml file for the specific global memory.
NUM_CHANNELS Number of interfaces an external memory has. You can find this by counting the number of interfaces listed in the board_spec.xml file under that memory.
NUM_INTERLEAVING_CHANNELS When interleaving is enabled, this is the number of channels. Otherwise, this value is 1.
BW1 Bottleneck at the kernel boundary. Therefore, BW1 uses only kernel values, which means, values you can change by optimizing the design. If this is limiting the overall bandwidth use than it indicates, changing your design can improve the bottleneck at the kernel boundary.
BW2 Bottleneck at the memory interface to the kernel. Therefore, BW2 uses the size of the memory interface and the FMAX, which means either improving FMAX of your design or switching to a board with a wider memory interface can improve the bandwidth use.
BW3 Bottleneck in the external memory. Therefore, BW3 uses external memory properties exclusively, and if this is limiting your design, you have utilized the board bandwidth completely.