Visible to Intel only — GUID: ase1566243064294
Ixiasoft
Visible to Intel only — GUID: ase1566243064294
Ixiasoft
3.6.1. Load-Store Unit Types
- Burst-Coalesced Load-Store Units
- Prefetching Load-Store Units
- Pipelined Load-Store Units
- Constant-Pipelined Load-Store Units
- Atomic-Pipelined Load-Store Units
Burst-Coalesced Load-Store Units
A burst-coalesced LSU is the default LSU type instantiated by the compiler for accessing global memory. It buffers requests until the largest possible burst can be made. The burst-coalesced LSU can provide efficient access to global memory, but it requires a considerable amount of FPGA resources.
kernel void burst_coalesced (global int * restrict in,
global int * restrict out) {
int i = get_global_id(0);
int value = in[i/2]; // Burst-coalesced LSU
out[i] = value;
}
Prefetching Load-Store Units
A prefetching LSU instantiates a FIFO which burst reads large blocks from memory to keep the FIFO full of valid data based on the previous address and assuming contiguous reads. Non-contiguous reads are supported, but a penalty is incurred to flush and refill the FIFO. A prefetching LSU is inferred only for non-volatile global pointers.
kernel void prefetching (global int * restrict in,
global int * restrict out,
int N) {
int res = 1;
for (int i = 0; i < N; i++) {
int v = in[i]; // Prefetching LSU
res ^= v;
}
out[0] = res;
}
Pipelined Load-Store Units
A pipelined LSU is used for accessing local memory. Requests are submitted as soon as they are received. Memory accesses are pipelined, so multiple requests can be in flight at a time. If there is no arbitration between the LSU and the local memory, a pipelined never-stall LSU is created.
__attribute((reqd_work_group_size(1024,1,1)))
kernel void local_pipelined (global int* restrict in,
global int* restrict out) {
local int lmem[1024];
int gi = get_global_id(0);
int li = get_local_id(0);
int res = in[gi];
for (int i = 0; i < 4; i++) {
lmem[li - i] = res; // pipelined LSU
res >>= 1;
}
barrier(CLK_GLOBAL_MEM_FENCE);
res = 0;
for (int i = 0; i < 4; i++) {
res ^= lmem[li - i]; // pipelined LSU
}
out[gi] = res;
}
The compiler may also infer a pipelined LSU for global memory accesses that can be proven to be infrequent. The compiler uses a pipelined LSU for such accesses because a pipelined LSU is smaller than other LSU types. While a pipelined LSU might have lower throughput, this throughput tradeoff is acceptable because memory accesses are infrequent.
kernel void global_infrequent (global int * restrict in,
global int * restrict out,
int N) {
int a = 0;
if (get_global_id(0) == 0)
a = in[0]; // Pipelined LSU
for (int i = 0; i < N; i++) {
out[i] = in[i] + a;
}
}
Constant-Pipelined Load-Store Units
A constant-pipelined LSU is a pipelined LSU that is used mainly to read from the constant cache. The constant-pipelined LSU consumes less area than a burst-coalesced LSU. The throughput of a constant-pipelined LSU depends greatly on whether the reads hit in the constant cache. Cache misses are expensive.
kernel void constant_pipelined (constant int *src,
global int *dst) {
int i = get_global_id(0);
dst[i] = src[i]; // Constant pipelined LSU
}
For information about the constant cache, see Constant Cache Memory.
Atomic-Pipelined Load-Store Units
An atomic-pipelined LSU is used for all atomic operations. Using atomic operations can significantly reduce kernel performance.
kernel void atomic_pipelined (global int* restrict out) {
atomic_add(&out[0], 1); // Atomic LSU
}