Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 6/21/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

3.6.1. Load-Store Unit Types

The compiler can generate several different types of load-store units (LSUs) based on the inferred memory access pattern, the types of memory available on the target platform, and whether the memory accesses are to local or global memory. The Intel® FPGA SDK for OpenCL™ Offline Compiler can generate the following types of LSU:

Burst-Coalesced Load-Store Units

A burst-coalesced LSU is the default LSU type instantiated by the compiler for accessing global memory. It buffers requests until the largest possible burst can be made. The burst-coalesced LSU can provide efficient access to global memory, but it requires a considerable amount of FPGA resources.

kernel void burst_coalesced (global int * restrict in, 
                             global int * restrict out) {
  int i = get_global_id(0);
  int value = in[i/2];      // Burst-coalesced LSU
  out[i] = value;
}
Depending on the memory access pattern and other attributes, the compiler might modify a burst-coalesced LSU in the following ways:

Prefetching Load-Store Units

A prefetching LSU instantiates a FIFO which burst reads large blocks from memory to keep the FIFO full of valid data based on the previous address and assuming contiguous reads. Non-contiguous reads are supported, but a penalty is incurred to flush and refill the FIFO. A prefetching LSU is inferred only for non-volatile global pointers.

kernel void prefetching (global int * restrict in, 
                         global int * restrict out, 
                         int N) {
  int res = 1;
  for (int i = 0; i < N; i++) {
    int v = in[i];             // Prefetching LSU
    res ^= v;
  }
  out[0] = res;
}

Pipelined Load-Store Units

A pipelined LSU is used for accessing local memory. Requests are submitted as soon as they are received. Memory accesses are pipelined, so multiple requests can be in flight at a time. If there is no arbitration between the LSU and the local memory, a pipelined never-stall LSU is created.

__attribute((reqd_work_group_size(1024,1,1)))
kernel void local_pipelined (global int* restrict in, 
                             global int* restrict out) {
  local int lmem[1024];
  int gi = get_global_id(0);
  int li = get_local_id(0);

  int res = in[gi];
  for (int i = 0; i < 4; i++) {
    lmem[li - i] = res;                 // pipelined LSU
    res >>= 1;
  }

  barrier(CLK_GLOBAL_MEM_FENCE);

  res = 0;
  for (int i = 0; i < 4; i++) {
    res ^= lmem[li - i];                // pipelined LSU
  }

  out[gi] = res;
}
The compiler might modify a local-pipelined LSU in the following way:

The compiler may also infer a pipelined LSU for global memory accesses that can be proven to be infrequent. The compiler uses a pipelined LSU for such accesses because a pipelined LSU is smaller than other LSU types. While a pipelined LSU might have lower throughput, this throughput tradeoff is acceptable because memory accesses are infrequent.

kernel void global_infrequent (global int * restrict in, 
                               global int * restrict out, 
                               int N) {
  int a = 0;
  if (get_global_id(0) == 0)
      a = in[0];                   // Pipelined LSU
  for (int i = 0; i < N; i++) {
    out[i] = in[i] + a;
  }
}

Constant-Pipelined Load-Store Units

A constant-pipelined LSU is a pipelined LSU that is used mainly to read from the constant cache. The constant-pipelined LSU consumes less area than a burst-coalesced LSU. The throughput of a constant-pipelined LSU depends greatly on whether the reads hit in the constant cache. Cache misses are expensive.

 kernel void constant_pipelined (constant int *src, 
                                 global int *dst) {
  int i = get_global_id(0);
  dst[i] = src[i];              // Constant pipelined LSU
}

For information about the constant cache, see Constant Cache Memory.

Atomic-Pipelined Load-Store Units

An atomic-pipelined LSU is used for all atomic operations. Using atomic operations can significantly reduce kernel performance.

kernel void atomic_pipelined (global int* restrict out) {
  atomic_add(&out[0], 1);  // Atomic LSU
}