Intel® FPGA SDK for OpenCL™ Pro Edition: Programming Guide

ID 683846
Date 3/28/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

5.2.14. Specifying the private_copies Memory Attribute

You have the option to apply the private_copies memory attribute to a variable declaration inside an OpenCL kernel as follows:
int __attribute__((private_copies(k)) local_A[M];

where, k is an unsigned integer. When this attribute is applied to a variable declared or accessed inside a pipelined loop, the Intel® FPGA SDK for OpenCL™ Offline Compiler creates k independent copies of the memory implementing this variable. This allows up to k iterations of the pipelined loop to run in parallel, where each iteration accesses its own copy of the memory. If this attribute is not applied or if k is set to 0, then the compiler chooses an appropriate number of copies, up to a maximum of 16 to maximize throughput.

Consider the following example where the outer loop declares four local arrays:

for (int i = 0; i < N; i++) {
  int local_A[M];
  int local_B[M];
  int local_C[M];
  int local_D[M];

  // Step 1
  for (int j = 0; j < M; j++) {
    local_A[j ] = initA();
  }

  // Step 2
  for (int j = 0; j < M; j++) {
    local_B[j] = initB(local_A[j]);
  }

  // Step 3
  for (int j = 0; j < M; j++) {
    local_C[j] = initC(local_B[j]);
  }

  // Step 4
  for (int j = 0; j < M; j++) {
    local_D[j] = initD(local_C[j]);
  }
}

In this example, the outer loop contains four steps, where each step corresponds to an inner loop. In Step 1, the first local array local_A is initialized. In Step 2, local_A is read from, but not written to. This is the last use of local_A in the outer loop. Similarly, local_B is first used in Step 2, where it is initialized. In Step 3, local_B is read from, but not written to, and this is the last use of local_B. Similarly, local_C is used only in Steps 3 and 4. The Intel® FPGA SDK for OpenCL™ Offline Compiler privatizes each array by making 16 copies. These copies are enough to support concurrency of 16 on the outer loop. However, because the live ranges of these local arrays do not span the entire outer loop, all 16 copies are not required to maximize throughput of the outer loop. This means that the amount of area consumed by making these copies is higher than necessary. In this case, applying the private_copies attribute to control the number of copies of these local arrays can reduce the area used while maintaining the throughput of the outer loop.