Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 6/21/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

8.4. Improving Kernel Performance by Banking the Local Memory

Specifying the numbanks(N) and bankwidth(M) advanced kernel attributes allows you to configure the local memory banks for parallel memory accesses. The banking geometry described by these advanced kernel attributes determines which elements of the local memory system your kernel can access in parallel.

The following code example depicts an 8 x 4 local memory system that is implemented in a single bank. As a result, no two elements in the system can be accessed in parallel.

local int lmem[8][4];

#pragma unroll
for(int i = 0; i<4; i+=2) {
    lmem[i][x] = …; 
}
Figure 82. Serial Accesses to an 8 x 4 Local Memory System

To improve performance, you can add numbanks(N) and bankwidth(M) in your code to define the number of memory banks and the bank widths in bytes. The following code implements eight memory banks, each 16-bytes wide. This memory bank configuration enables parallel memory accesses down the 8 x 4 array.

local int __attribute__((numbanks(8), 
                        bankwidth(16)))
                        lmem[8][4]; 
#pragma unroll
for (int i = 0; i < 4; i+=2) {
    lmem[i][x & 0x3] = …; 
}
Attention:

To enable parallel access, you must mask the dynamic access on the lower array index. Masking the dynamic access on the lower array index informs the Intel® FPGA SDK for OpenCL™ Offline Compiler that x does not exceed the lower index bounds.

Figure 83. Parallel Access to an 8 x 4 Local Memory System with Eight 16-Byte-Wide Memory Banks

By specifying different values for the numbanks(N) and bankwidth(M) kernel attributes, you can change the parallel access pattern. The following code implements four memory banks, each 4-bytes wide. This memory bank configuration enables parallel memory accesses across the 8 x 4 array.

local int __attribute__((numbanks(4), 
                        bankwidth(4)))
                        lmem[8][4]; 

#pragma unroll
for (int i = 0; i < 4; i+=2) {
    lmem[x][i] = …; 
}
Figure 84. Parallel Access to an 8 x 4 Local Memory System with Four 4-Byte-Wide Memory Banks