Perform Kernel Computations Using Local or Private Memory
To optimize memory access efficiency, minimize the number of global memory accesses by performing kernel computations in local or private memory.
To minimize global memory accesses, it is often best to preload data from a group of computations from global memory to a local or private memory. Perform kernel computations on the preloaded data and write the results back to the global memory.
Preload Data into Local Memory or Private Memory
Local memory is considerably smaller than global memory, but it has significantly higher bandwidth and much lower latency. Unlike global memory accesses, the kernel can access local memory randomly without any performance penalty. When you structure your kernel code, attempt to access the global memory sequentially, and buffer that data in on-chip local memory before your kernel uses the data for computation.
Store Variables and Arrays in Private Memory
The
Intel® oneAPI
implements private memory using FPGA registers in the kernel datapath, block RAMs, or MLABs. The
DPC++/C++
CompilerIntel® oneAPI
analyzes the private memory accesses and promotes them to register accesses. Scalar variables, for example
DPC++/C++
Compilerfloat
,
int
and
char
, are typically promoted. Aggregate data types are promoted if array-access indices are compile-time constants. Typically, private memory is useful for storing single variables or small arrays. Registers are plentiful hardware resources in FPGAs, and it is usually better to use private memory instead of other memory types whenever possible. The kernel can access private memories in parallel, allowing them to provide more bandwidth than any other memory type (global and local).
For more information on the implementation of private memory using registers, refer to
Inferring a Shift Register.