Transfer Loop-Carried Dependency to Local Memory

Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

Download PDF

ID 767853

Date 12/16/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-71FEE08A-CD2F-4B5E-8DEB-076FC6C71FE5

View Details

Transfer Loop-Carried Dependency to Local Memory

Loop-carried dependencies can adversely affect the loop initiation interval or II (refer to Pipelining Across Multiple Work Items). For a loop-carried dependency that you cannot remove, improve the II by moving the array with the loop-carried dependency from global memory to local memory.

Consider the following example:


constexpr int N = 128;
queue.submit([&](handler &cgh) {
  accessor A(A_buf, cgh, read_write);
  cgh.single_task<class unoptimized>([=]() {
    for (int i = 0; i < N; i++)
      A[N-i] = A[i];
    }
  });
});

Global memory accesses have long latencies. In this example, the loop-carried dependency on the array A[i] causes long latency. The optimization report reflects this latency with an II of 185. Perform the following tasks to reduce the II value by transferring the loop-carried dependency from global memory to local memory:

Copy the array with the loop-carried dependency to local memory. In this example, array A[i] becomes array B[i] in local memory.
Execute the loop with the loop-carried dependence on array B[i].
Copy the array back to global memory.

When you transfer array A[i] to local memory and it becomes array B[i], the loop-carried dependency is now on B[i]. Because local memory has a much lower latency than global memory, the II value improves.

Following is the restructured kernel optimized:


constexpr int N = 128;
queue.submit([&](handler &cgh) {
  accessor A(A_buf, cgh, read_write);
  cgh.single_task<class optimized>([=]() {
    int B[N];
    for (int i = 0; i < N; i++)
      B[i] = A[i];

    for (int i = 0; i < N; i++)
      B[N-i] = B[i];

    for (int i = 0; i < N; i++)
      A[i] = B[i];
  });
});

Parent topic: Loops

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

FPGA Optimization Guide for Intel® oneAPI Toolkits

Transfer Loop-Carried Dependency to Local Memory