Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

ID 683176
Date 9/24/2018
Public
Document Table of Contents

5.1.3. Simplifying Loop-Carried Dependency

In cases where you cannot remove or relax the loop-carried dependency in your kernel, you might be able to simplify the dependency to improve single work-item kernel performance.

Consider the following kernel example:

 1 #define N 128
 2 #define NUM_CH 3
 3 
 4 channel uchar CH_DATA_IN[NUM_CH];
 5 channel uchar CH_DATA_OUT;
 6 
 7 __kernel void unoptimized()
 8 {
 9   unsigned storage = 0;
10   unsigned num_bytes = 0;
11 
12   for (unsigned i = 0; i < N; i++) {
13 
14     #pragma unroll
15     for (unsigned j = 0; j < NUM_CH; j++) {
16       if (num_bytes < NUM_CH) {
17         bool valid = false;
18         uchar data_in = read_channel_nb_intel(CH_DATA_IN[j], &valid);
19         if (valid) {
20           storage <<= 8;
21           storage |= data_in;
22           num_bytes++;
23         }
24       }
25     }
26 
27     if (num_bytes >= 1) {
28       num_bytes -= 1;
29       uchar data_out = storage >> (num_bytes*8);
30       write_channel_intel(CH_DATA_OUT, data_out);
31     }
32   }
33 } 

This kernel reads one byte of data from three input channels in a nonblocking fashion. It then writes the data one byte at a time to an output channel. It uses the variable storage to store up to 4 bytes of data, and uses the variable num_bytes to keep track of how many bytes are stored in storage. If storage has space available, then the kernel reads a byte of data from one of the channels and stores it in the least significant byte of storage.

The optimization report below indicates that there is a loop-carried dependency on the variable num_bytes:

==================================================================================
Kernel: unoptimized
==================================================================================
The kernel is compiled for single work-item execution.

Loop Report:

 + Loop "Block1" (file unoptimized3.cl line 12)
 | Pipelined with successive iterations launched every 7 cycles due to:
 |
 |     Data dependency on variable num_bytes  (file unoptimized3.cl line 10)
 |     Largest Critical Path Contributors:
 |      16%: Integer Compare Operation  (file unoptimized3.cl line 16)
 |      16%: Integer Compare Operation  (file unoptimized3.cl line 16)
 |      16%: Integer Compare Operation  (file unoptimized3.cl line 16)
 |       7%: Integer Compare Operation  (file unoptimized3.cl line 27)
 |       6%: Add Operation  (file unoptimized3.cl line 10, line 22, line 28)
 |       6%: Add Operation  (file unoptimized3.cl line 10, line 22, line 28)
 |       6%: Add Operation  (file unoptimized3.cl line 10, line 22, line 28)
 |       3%: Non-Blocking Channel Read Operation  (file unoptimized3.cl line 18)
 |       3%: Non-Blocking Channel Read Operation  (file unoptimized3.cl line 18)
 |       3%: Non-Blocking Channel Read Operation  (file unoptimized3.cl line 18)
 |
 |
 |-+ Fully unrolled loop (file unoptimized3.cl line 15)
     Loop was fully unrolled due to "#pragma unroll" annotation.  

The computation path of num_bytes is as follows:

  1. Comparison on line 16 (if (num_bytes < NUM_CH)).
  2. Computation of variable valid by the nonblocking channel read operation on line 18 (uchar data_in = read_channel_nb_intel(CH_DATA_IN[j], &valid)) for the comparison on line 19.
  3. Addition on line 22 (num_bytes++).
  4. Comparison on line 27 (if (num_bytes >= 1)).
  5. Subtraction on line 28 (num_bytes -= 1).

Because of the unroll pragma on line 14, the unrolls the loop, causing the comparisons and additions in the loop body to replicate three times. The optimization report shows that the comparisons are the most expensive operations on the computation path of num_bytes, followed by the additions on line 22.

To simplify the loop-carried dependency on num_bytes, consider restructuring the application to perform the following tasks:

  1. Ensure that the kernel reads from the channels only if there is enough space available in storage, in the event that all channel read operations return data (that is, there is at least 3 bytes of empty space in storage).
    Setting this condition simplifies the computation path of the variable num_bytes by reducing the number of comparisons.
  2. Increase the size of storage from 4 bytes to 8 bytes to satisfy the 3-byte space threshold more easily.
Below is the restructured kernel optimized:
 1 #define N 128
 2 #define NUM_CH 3
 3 
 4 channel uchar CH_DATA_IN[NUM_CH];
 5 channel uchar CH_DATA_OUT;
 6 
 7 __kernel void optimized()
 8 {
 9   // Change storage to 64 bits
10   ulong storage = 0;
11   unsigned num_bytes = 0;
12 
13   for (unsigned i = 0; i < N; i++) {
14 
15     // Ensure that we have enough space if we read from ALL channels
16     if (num_bytes <= (8-NUM_CH)) {
17       #pragma unroll
18       for (unsigned j = 0; j < NUM_CH; j++) {
19         bool valid = false;
20         uchar data_in = read_channel_nb_intel(CH_DATA_IN[j], &valid);
21         if (valid) {
22           storage <<= 8;
23           storage |= data_in;
24           num_bytes++;
25         }
26       }
27     }
28 
29     if (num_bytes >= 1) {
30       num_bytes -= 1;
31       uchar data_out = storage >> (num_bytes*8);
32       write_channel_intel(CH_DATA_OUT, data_out);
33     }
34   }
35 }

An optimization report similar to the one below indicates the successful simplification of the loop-carried dependency on the variable num_bytes:

==================================================================================
Kernel: optimized
==================================================================================
The kernel is compiled for single work-item execution.

Loop Report:

 + Loop "Block1" (file optimized3.cl line 13)
 | Pipelined well. Successive iterations are launched every cycle.
 |
 |
 |-+ Fully unrolled loop (file optimized3.cl line 18)
     Loop was fully unrolled due to "#pragma unroll" annotation.