Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 12/19/2022
Document Table of Contents

3.6.2. Load-Store Unit Modifiers

Depending on the memory access pattern in your kernel, the compiler modifies some LSUs.


Burst-coalesced LSUs might sometimes include a cache. A cache is created when the memory access pattern is data-dependent or appears to be repetitive. The cache cannot be shared with other loads even if the loads want the same data. The cache is flushed on kernel start and consumes more hardware resources than an equivalent LSU without a cache. The cache is inferred only for non-volatile global pointers.

kernel void cached (global int * restrict in, 
                    global int * restrict out, 
                    int N) {
  int gid = get_global_id(0);
  for (int i = 0; i < N; i++) {
    out[N*gid + i] = in[i];

Write-Acknowledge (write-ack)

Burst-coalesced store LSUs sometimes require a write-acknowledgment signal when data dependencies exist. LSUs with a write-acknowledge signal require additional hardware resources. Throughput might be reduced if multiple write-acknowledge LSUs access the same memory.

kernel void write_ack (global int * restrict in, 
                       global int * restrict out, 
                       int N) {
  for (int i = 0; i < N; i++) {
    if (i < 2)
      out[i] = 0;            // Burst-coalesced write-ack LSU
    out[i] = in[i];


When a burst-coalesced LSU can access memory that is not aligned to the external memory word size, a nonaligned LSU is created. Additional hardware resources are required to implement a nonaligned LSU. The throughput of a nonaligned LSU might be reduced if it receives many unaligned requests.

kernel void non_aligned (global int * restrict in, 
                         global int * restrict out) {
  int i = get_global_id(0);
  // Three loads are statically coalesced into one, 
  // creating a burst-coalesced non-aligned LSU.
  int a1 = in[3*i+0];
  int a2 = in[3*i+1];
  int a3 = in[3*i+2];
  // Three stores statically coalesced into one, 
  // creating a burst-coalesced non-aligned LSU.
  out[3*i+0] = a3;
  out[3*i+1] = a2;
  out[3*i+2] = a1;


If a pipelined LSU is connected to a local memory without arbitration, a never-stall LSU is created because all accesses to the memory take a fixed number of cycles that are known to the compiler.

kernel void never_stall (global int* restrict in, 
                         global int* restrict out, 
                         int N) {
  local int lmem[1024];
  int gi = get_global_id(0);
  int li = get_local_id(0);

  lmem[li] = in[gi];                   // Pipelined never-stall LSU
  out[gi] = lmem[li] ^ lmem[li + 1];