Visible to Intel only — GUID: xeb1517862207805
Ixiasoft
Visible to Intel only — GUID: xeb1517862207805
Ixiasoft
5.1.4. Transferring Loop-Carried Dependency to Local Memory
Consider the following kernel example:
1 #define N 128
2
3 __kernel void unoptimized( __global int* restrict A )
4 {
5 for (unsigned i = 0; i < N; i++)
6 A[N-i] = A[i];
7 }
================================================================================== Kernel: unoptimized ================================================================================== The kernel is compiled for single work-item execution. Loop Report: + Loop "Block1" (file unoptimized4.cl line 5) Pipelined with successive iterations launched every 324 cycles due to: Memory dependency on Load Operation from: (file unoptimized4.cl line 6) Store Operation (file unoptimized4.cl line 6) Largest Critical Path Contributors: 49%: Load Operation (file unoptimized4.cl line 6) 49%: Store Operation (file unoptimized4.cl line 6)
Global memory accesses have long latencies. In this example, the loop-carried dependency on the array A[i] causes the long latency. This latency is reflected by an II of 324 in the optimization report. To reduce the II value by transferring the loop-carried dependency from global memory to local memory, perform the following tasks:
- Copy the array with the loop-carried dependency to local memory. In this example, array A[i] becomes array B[i] in local memory.
- Execute the loop with the loop-carried dependence on array B[i].
- Copy the array back to global memory.
Below is the restructured kernel optimized:
1 #define N 128
2
3 __kernel void optimized( __global int* restrict A )
4 {
5 int B[N];
6
7 for (unsigned i = 0; i < N; i++)
8 B[i] = A[i];
9
10 for (unsigned i = 0; i < N; i++)
11 B[N-i] = B[i];
12
13 for (unsigned i = 0; i < N; i++)
14 A[i] = B[i];
15 }
An optimization report similar to the one below indicates the successful reduction of II from 324 to 2:
================================================================================== Kernel: optimized ================================================================================== The kernel is compiled for single work-item execution. Loop Report: + Loop "Block1" (file optimized4.cl line 7) Pipelined well. Successive iterations are launched every cycle. + Loop "Block2" (file optimized4.cl line 10) Pipelined with successive iterations launched every 2 cycles due to: Memory dependency on Load Operation from: (file optimized4.cl line 11) Store Operation (file optimized4.cl line 11) Largest Critical Path Contributors: 65%: Load Operation (file optimized4.cl line 11) 34%: Store Operation (file optimized4.cl line 11) + Loop "Block3" (file optimized4.cl line 13) Pipelined well. Successive iterations are launched every cycle.