Developer Guide

Intel oneAPI DPC++/C++ Compiler Handbook for Intel FPGAs

ID 785441
Date 5/08/2024
Public
Document Table of Contents

Relax Loop-Carried Dependency

Example 1: Multiply Chain

Based on the feedback from the optimization report, you can relax a loop-carried dependency by allowing the compiler to infer a shift register and increase the dependence distance. Increase the dependence distance by increasing the number of loop iterations that occur between the generation of a loop-carried value and its use.

Consider the following code example:

constexpr int N = 128;
 queue.submit([&](handler &cgh) {
   accessor A(A_buf, cgh, read_only);
   accessor result(result_buf, cgh, write_only);
   cgh.single_task<class unoptimized>([=]() {
     float mul = 1.0f;
     for (int i = 0; i < N; i++)
       mul *= A[i];
 
     result[0] = mul;
   });
 });

The optimization report shows that the Intel® oneAPI DPC++/C++ Compiler infers pipelined execution for the loop successfully. However, the loop-carried dependency on the variable mul causes loop iterations to launch every six cycles. In this case, the floating-point multiplication operation on line 8 (that is, mul *= A[i]) contributes the largest delay to the computation of the variable mul. The latency through the floating point multiplier means you have to wait for the multiply to complete (6 cycles) before feeding the output back into the input with A[i].

To relax the loop-carried data dependency, instead of using a single variable to store the multiplication results, operate on M copies of the variable, and use one copy every M iterations:

  1. Declare multiple copies of the variable mul (for example, in an array called mul_copies).
  2. Initialize all copies of mul_copies.
  3. Use the last copy in the array in the multiplication operation.
  4. Perform a shift operation to pass the last value of the array back to the beginning of the shift register.
  5. Sum up all copies of mul_copies to mul and write the final value to the result.

The following code illustrates the restructured kernel:

constexpr int N = 128;
 constexpr int M = 8;
 queue.submit([&](handler &cgh) {
   accessor A(A_buf, cgh, read_only);
   accessor result(result_buf, cgh, write_only);
   cgh.single_task<class optimized>([=]() {
     float mul = 1.0f;
 
     // Step 1: Declare multiple copies of variable mul
     float mul_copies[M];
 
     // Step 2: Initialize all copies
     for (int i = 0; i < M; i++)
       mul_copies[i] = 1.0f;
 
     for (int i = 0; i < N; i++) {
       // Step 3: Perform multiplication on the last copy
       float cur = mul_copies[M-1] * A[i];
 
       // Step 4a: Shift copies
       #pragma unroll 
       for (int j = M-1; j > 0; j--)
         mul_copies[j] = mul_copies[j-1];
 
       // Step 4b: Insert updated copy at the beginning
       mul_copies[0] = cur;
     }
 
     // Step 5: Perform reduction on copies
     #pragma unroll 
     for (int i = 0; i < M; i++)
       mul *= mul_copies[i];
 
     result[0] = mul;
   });
 });

Example 2: Accumulator

Similar to Example 1, this example also applies the same technique and demonstrates how to write single work-item kernels that carry out double precision floating-point operations efficiently by inferring a shift register.

Consider the following example:

queue.submit([&](handler &cgh) { 
  accessor arr(arr_buf, cgh, read_only); 
  accessor result(result_buf, cgh, write_only); 
  accessor N(N_buf, cgh, read_only); 
  cgh.single_task<class unoptimized>([=]() { 
    double temp_sum = 0; 
    for (int i = 0; i < *N; ++i) 
      temp_sum += arr[i]; 
    result[0] = temp_sum; 
  }); 
});

The kernel unoptimized is an accumulator that sums the elements of a double precision floating-point array arr[i]. For each loop iteration, the Intel® oneAPI DPC++/C++ Compiler takes 10 cycles to compute the result of the addition and then stores it in the variable temp_sum. Each loop iteration requires the value of temp_sum from the previous loop iteration, which creates a data dependency on temp_sum. To relax the data dependency, infer the array arr[i] as a shift register.

The following is the restructured kernel optimized:

//Shift register size must be statically determinable 
constexpr int II_CYCLES = 12; 
queue.submit([&](handler &cgh) { 
  accessor arr(arr_buf, cgh, read_only); 
  accessor result(result_buf, cgh, write_only); 
  accessor N(N_buf, cgh, read_only); 
  cgh.single_task<class optimized>([=]() { 
    //Create shift register with II_CYCLE+1 elements 
    double shift_reg[II_CYCLES+1]; 
 
    //Initialize all elements of the register to 0 
    for (int i = 0; i < II_CYCLES + 1; i++) { 
      shift_reg[i] = 0; 
    } 
     
    //Iterate through every element of input array 
    for(int i = 0; i < N; ++i){ 
      //Load ith element into end of shift register 
      //if N > II_CYCLE, add to shift_reg[0] to preserve values 
      shift_reg[II_CYCLES] = shift_reg[0] + arr[i]; 
   
      #pragma unroll 
      //Shift every element of shift register 
      for(int j = 0; j < II_CYCLES; ++j) 
      { 
        shift_reg[j] = shift_reg[j + 1]; 
      }  
    } 
  
    //Sum every element of shift register 
    double temp_sum = 0; 
   
    #pragma unroll  
    for(int i = 0; i < II_CYCLES; ++i) 
    { 
      temp_sum += shift_reg[i]; 
    } 
  
    result[0] = temp_sum; 
  }); 
});