The pipeline parallel nature of DPC++ execution on FPGA means that memory loads and stores in your DPC++ code compete for access to memory resources (global, local, and private memories). If your DPC++ kernel performs a large number of memory accesses, the compiler must generate arbitration logic to share the available memory bandwidth between memory access sites in your kernel's datapath. If the bandwidth demanded by the datapath exceeds what the memory and arbitration logic can provide, the datapath stalls. This degrades kernel’s throughput because the compute pipeline must wait for a memory access before resuming.
When optimizing your design, it is important to understand whether your DPC++ kernel's throughput is limited by memory accesses (a memory-bound kernel) or by the structure of the kernel datapath (a compute-bound kernel). These situations require different optimization techniques. The following sections discuss memory access optimization in detail.