Visible to Intel only — GUID: uir1520273525684
Ixiasoft
Visible to Intel only — GUID: uir1520273525684
Ixiasoft
10.4. On-Chip Storage of Reused Data
The unoptimized code below creates a cached burst-coalesced LSU that consumes more resources and disables other optimizations:
kernel void cached (global int * restrict in,
global int * restrict out) {
int i = get_global_id(0);
int idx = out[i];
int cached_value = in[idx]; // Burst-coalesced cached LSU
out[i] = cached_value;
}
To prevent the caching, mark the memory pointer as volatile, as shown in the optimized code below:
kernel void not_cached (global volatile int * restrict in,
global int * restrict out) {
int i = get_global_id(0);
int idx = out[i];
int not_cached_value = in[idx];
out[i] = not_cached_value;
}
For more information about optimizing load-store units, refer to the Load-Store Units section.
Intel® also recommends that you use on-chip storage to achieve optimal results. Note that compared to non- Intel® Stratix® 10 devices, Intel® Stratix® 10 has a larger M20K to ALM ratio, allowing you to create larger local memory systems.
The unoptimized code below has a function that receives a pointer from the array in global memory. In this case, the offline compiler modifies the array and then stores it back to memory. Then, in a subsequent iteration of the outer loop, the array is reused.
void cached_array(int N, int M, int BUF_SIZE,
global float2 * global_buf)
{
for (uint i = 0; i< N; i++) {
float2 data_array[BUF_SIZE];
for (uint j= 0; j < M; j++) {
data_array[i] = global_buf [j]; //Load value
... //do work to modify data_array
global_buf[j] = data_array[i];
}
}
}
To prevent unnecessary global memory accesses, define a private array in the kernel to declare the array on chip. The function accesses the private array instead of the one declared on chip. As a result, the array receives on-chip local memory storage. Accessing this on-chip local memory does not require accesses to global memory.
Optimized code:
void local_array(int N, int M, int BUF_SIZE,
global float2 * global_buf)
{
float2 local_buf[BUF_SIZE];
populate_buf(local_buf, global_buf);
for (uint i = 0; i< N; i++) {
float2 data_array[BUF_SIZE];
for (uint j= 0; j < M; j++) {
data_array[i] = local_buf[j];//Load value
... //do work to modify data_array
local_buf[j] = data_array[i];
}
}
}