Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 12/19/2022
Public
Document Table of Contents

3.5. Channels

The Intel® FPGA SDK for OpenCL™'s channel implementation provides a flexible way to pass data from one kernel to another kernel to improve performance.

When declaring a channel in your kernel code, precede the declaration with the keyword channel.

For example:

channel long16 myCh __attribute__((depth(16)));

In the HTML report, the area report maps the channel area to the declaration line in the source code. Channels and channel arrays are reported with their width and depth.

Note: The implemented channel depth can differ from the depth that you specify in the channel declaration. The Intel FPGA SDK for OpenCL Offline Compiler can implement the channel in shift registers or RAM blocks. The offline compiler decides on the type of channel implementation based on the channel depth.

The depth attribute is treated as the minimum depth specification. The offline compiler may increase the depth for the following reasons:

  • Instruction scheduling requirements. This may happen due to the following reasons:
    • To balance reconverging paths through multiple kernels.

      When multiple paths exist from one kernel to another via channels and other kernels, the depths on these channels may be increased to balance latencies among all these paths. This is a throughput optimization, as unbalanced paths are likely to lead to pipeline stalls.

    • To achieve a lower II for a loop containing a non-blocking write to a channel.

      The offline compiler may increase the depth of the channel in order to achieve a lower loop II.

      Consider the following loop that reads next_val from the global memory only if the non-blocking write to my_channel succeeded in the previous iteration.

      bool write_valid = true;
      int next_val = 0;
      while (not_done) {
         if (write_valid) {
            next_val = *global_mem_ptr;
            global_mem_ptr++;
         }
         write_valid = write_channel_nb_intel(my_channel, next_val);
         not_done = some_fn(next_val);
      }

      With a naive implementation, this loop has a very high II because the high-latency global memory read must complete before the write into the channel can begin and the next value of write_valid is computed. To remove the global read from the II-critical path, the compiler can instead check if my_channel has space to accept whatever value is read from the global memory before doing the actual global read. The check for channel fullness takes one clock cycle and hence, the next loop iteration can start as soon as the channel fullness check is complete, giving II=1 for this loop. To make the resulting hardware functionally correct, the channel must be deepened by the latency of the global read or, to be precise, schedule distance between the channel fullness check and the actual write, which may be slightly greater than the global read. If you do not want to have your channel deepened in this situation, identify and remove the loop-carried dependency involving a valid return valid from the write_channel_nb_intel() call.

  • The nature of the underlying FIFO implementation. This happens if the chosen underlying implementation cannot support the exact depth required and must be increased to the next supported size.