Visible to Intel only — GUID: ahd1504718313058
Ixiasoft
Visible to Intel only — GUID: ahd1504718313058
Ixiasoft
3.3.1. Changing the Memory Access Pattern Example
kernel void big_lmem_4r_4w_nosplit (global int* restrict in,
global int* restrict out) {
local int lmem[4][1024];
int gi = get_global_id(0);
int gs = get_global_size(0);
int li = get_local_id(0);
int ls = get_local_size(0);
int res = in[gi];
#pragma unroll
for (int i = 0; i < 4; i++) {
lmem[i][(li*i) % ls] = res;
res >>= 1; }
// Global memory barrier
barrier(CLK_GLOBAL_MEM_FENCE);
res = 0;
#pragma unroll
for (int i = 0; i < 4; i++) {
res ^= lmem[i][((ls-li)*i) % ls]; }
out[gi] = res;
}
In the System Viewer report, the system view of this example highlights the stallable loads and stores.
Observe that only two memory banks are created, with high arbitration on the first bank between load and store operations. Now, switch the banking indices to the second dimension, as shown in the following example code, :
kernel void big_lmem_4r_4w_nosplit (global int* restrict in,
global int* restrict out) {
local int lmem[1024][4];
int gi = get_global_id(0);
int gs = get_global_size(0);
int li = get_local_id(0);
int ls = get_local_size(0);
int res = in[gi];
#pragma unroll
for (int i = 0; i < 4; i++) {
lmem[(li*i) % ls][i] = res;
res >>= 1;
}
// Global memory barrier
barrier(CLK_GLOBAL_MEM_FENCE);
res = 0;
#pragma unroll
for (int i = 0; i < 4; i++) {
res ^= lmem[((ls-li)*i) % ls][i];
}
out[gi] = res;
}
In the kernel memory viewer, you can observe that now four memory banks are created, with separate load store units. All load store instructions are stall-free.
Did you find the information on this page useful?
Feedback Message
Characters remaining: