Developer Guide

Intel oneAPI DPC++/C++ Compiler Handbook for Intel FPGAs

ID 785441
Date 5/08/2024
Public
Document Table of Contents

Refactor the Loop-Carried Data Dependency

Based on the feedback from the optimization report, you can restructure your program to reduce the critical path of a pipelined loop by reducing the distance between the first time you use a loop-updated variable in the loop body and the last time you define it within a single iteration.

Consider the following code:

constexpr int N = 128;
queue.submit([&](handler &cgh) {
  accessor A(A_buf, cgh, read_only);
  accessor B(B_buf, cgh, read_only);
  accessor Result(Result_buf, cgh, write_only); 
  cgh.single_task<class unoptimized>([=]() {
    int sum = 0;
    for (int i = 0; i < N; i++) {
      for (int j = 0; j < N; j++) {
        sum += A[i * N + j];
      }
      sum += B[i];
    }
    Result[0] = sum;
  });
});
  • The report indicates that the Intel® oneAPI DPC++/C++ Compiler successfully infers pipelined execution for the outer loop, and a new loop iteration launches every other cycle.
  • The Only a single loop iteration will execute... message in the Details pane indicates that the loop executes a single iteration at a time across the subloop because of the data dependency on the variable sum. This data dependency exists because each outer loop iteration requires the value of sum from the previous iteration to return before the inner loop can start executing. Moreover, the serialization is enforced by a critical path spanning from the first use of sum in the body of the loop i to the last definition of sum at the end of the body of loop j.
  • The report also notifies you that the inner loop executes in a pipelined manner with no performance-limiting loop-carried dependencies.
NOTE:

For recommendations on how to structure your single work-item kernel, refer to Single Work-item Kernel Design Guidelines.

To optimize performance of this kernel, reduce the length of the critical path induced by the data dependency on variable sum so that the outer loop iterations do not execute serially across the subloop. Perform the following tasks to decouple the computations involving sum in the two loops:

  1. Define a local variable (for example, sum2) for use in the inner loop only.
  2. Use the local variable (sum2) to store the cumulative values of A[i * N + j] as the inner loop iterates.
  3. In the outer loop, store the cumulative values of B[i] and sum2 in the sum variable.

The following code illustrates the restructured optimized kernel snippet:

constexpr int N = 128;
 queue.submit([&](handler &cgh) {
   accessor A(A_buf, cgh, read_only);
   accessor B(B_buf, cgh, read_only);
   accessor Result(Result_buf, cgh, write_only);
   cgh.single_task<class optimized>([=]() {
     int sum = 0;
     for (int i = 0; i < N; i++) {
       // Step 1: Definition
       int sum2 = 0;
       // Step 2: Accumulation of array A values for one outer 
       // loop iteration
       for (int j = 0; j < N; j++) {
          sum2 += A[i * N + j];
       }
       // Step 3: Addition of array B value for an outer loop iteration
       sum += sum2;
       sum += B[i];
     }
     Result[0] = sum;
   });
 });