ID 683176
Date 9/24/2018
Public

## 5.1.2. Relaxing Loop-Carried Dependency

Based on the feedback from the optimization report, you can relax a loop-carried dependency by increasing the dependence distance. Increase the dependence distance by increasing the number of loop iterations that occurs between the generation of a loop-carried value and its usage.

Consider the following code example:

 1 #define N 128
2
3 __kernel void unoptimized (__global float * restrict A,
4                            __global float * restrict result)
5 {
6   float mul = 1.0f;
7
8   for (unsigned i = 0; i < N; i++)
9     mul *= A[i];
10
11   * result = mul;
12 }

==================================================================================
Kernel: unoptimized
==================================================================================
The kernel is compiled for single work-item execution.

Loop Report:

+ Loop "Block1" (file unoptimized.cl line 8)
Pipelined with successive iterations launched every 6 cycles due to:

Data dependency on variable mul  (file unoptimized.cl line 9)
Largest Critical Path Contributor:
100%: Fmul Operation  (file unoptimized.cl line 9)

===================================================================================

The optimization report above shows that the infers pipelined execution for the loop successfully. However, the loop-carried dependency on the variable mul causes loop iterations to launch every six cycles. In this case, the floating-point multiplication operation on line 9 (that is, mul *= A[i]) contributes the largest delay to the computation of the variable mul.

To relax the loop-carried data dependency, instead of using a single variable to store the multiplication results, operate on M copies of the variable and use one copy every M iterations:

1. Declare multiple copies of the variable mul (for example, in an array called mul_copies).
2. Initialize all the copies of mul_copies.
3. Use the last copy in the array in the multiplication operation.
4. Perform a shift operation to pass the last value of the array back to the beginning of the shift register.
5. Reduce all the copies to mul and write the final value to result.
Below is the restructured kernel:
 1 #define N 128
2 #define M 8
3
4 __kernel void optimized (__global float * restrict A,
5                          __global float * restrict result)
6 {
7   float mul = 1.0f;
8
9   // Step 1: Declare multiple copies of variable mul
10   float mul_copies[M];
11
12   // Step 2: Initialize all copies
13   for (unsigned i = 0; i < M; i++)
14     mul_copies[i] = 1.0f;
15
16   for (unsigned i = 0; i < N; i++) {
17     // Step 3: Perform multiplication on the last copy
18     float cur = mul_copies[M-1] * A[i];
19
20     // Step 4a: Shift copies
21     #pragma unroll
22     for (unsigned j = M-1; j > 0; j--)
23       mul_copies[j] = mul_copies[j-1];
24
25     // Step 4b: Insert updated copy at the beginning
26     mul_copies[0] = cur;
27   }
28
29   // Step 5: Perform reduction on copies
30   #pragma unroll
31   for (unsigned i = 0; i < M; i++)
32     mul *= mul_copies[i];
33
34   * result = mul;
35 }

An optimization report similar to the one below indicates the successful relaxation of the loop-carried dependency on the variable mul:

==================================================================================
Kernel: optimized
==================================================================================
The kernel is compiled for single work-item execution.

Loop Report:

+ Fully unrolled loop (file optimized2.cl line 13)
Loop was automatically and fully unrolled.
Add "#pragma unroll 1" to prevent automatic unrolling.

+ Loop "Block1" (file optimized2.cl line 16)
| Pipelined well. Successive iterations are launched every cycle.
|
|
|-+ Fully unrolled loop (file optimized2.cl line 22)
Loop was fully unrolled due to "#pragma unroll" annotation.

+ Fully unrolled loop (file optimized2.cl line 31)
Loop was fully unrolled due to "#pragma unroll" annotation.

Characters remaining:

Feedback Message