Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 12/19/2022
Public
Document Table of Contents

10.4. On-Chip Storage of Reused Data

For Intel® Stratix® 10 designs, the presence of cached LSUs prevent certain optimizations. Intel® recommends that you avoid inferring caching LSUs.

The unoptimized code below creates a cached burst-coalesced LSU that consumes more resources and disables other optimizations:

kernel void cached (global int * restrict in,
                    global int * restrict out) {
   int i = get_global_id(0);
   int idx = out[i];
   int cached_value = in[idx]; // Burst-coalesced cached LSU
   out[i] = cached_value;
}

To prevent the caching, mark the memory pointer as volatile, as shown in the optimized code below:

kernel void not_cached (global volatile int * restrict in,
                        global int * restrict out) {
   int i = get_global_id(0);
   int idx = out[i];
   int not_cached_value = in[idx];
   out[i] = not_cached_value;
}

For more information about optimizing load-store units, refer to the Load-Store Units section.

Intel® also recommends that you use on-chip storage to achieve optimal results. Note that compared to non- Intel® Stratix® 10 devices, Intel® Stratix® 10 has a larger M20K to ALM ratio, allowing you to create larger local memory systems.

The unoptimized code below has a function that receives a pointer from the array in global memory. In this case, the offline compiler modifies the array and then stores it back to memory. Then, in a subsequent iteration of the outer loop, the array is reused.

void cached_array(int N, int M, int BUF_SIZE, 
                  global float2 * global_buf)
{
   for (uint i = 0; i< N; i++) {
      float2 data_array[BUF_SIZE];
      for (uint j= 0; j < M; j++) {
	          
         data_array[i] = global_buf [j]; //Load value
		 
         ... //do work to modify data_array 
		 
         global_buf[j] = data_array[i]; 
      } 
   } 
}

To prevent unnecessary global memory accesses, define a private array in the kernel to declare the array on chip. The function accesses the private array instead of the one declared on chip. As a result, the array receives on-chip local memory storage. Accessing this on-chip local memory does not require accesses to global memory.

Optimized code:

void local_array(int N, int M, int BUF_SIZE, 
                 global float2 * global_buf)
{
   float2 local_buf[BUF_SIZE];
   populate_buf(local_buf, global_buf); 

   for (uint i = 0; i< N; i++) {
      float2	data_array[BUF_SIZE];			
      for (uint j= 0; j < M; j++) {

         data_array[i] = local_buf[j];//Load value 

		 ... //do work to modify data_array

         local_buf[j] = data_array[i]; 

      }
   } 
}