Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 12/19/2022
Public
Document Table of Contents

3.1. Kernels

Each kernel in your OpenCL system is represented by a set of blocks. Inside each block is a set of non-branching instructions that implement your algorithm and the offline compiler's loop orchestration logic. The block shows the execution flow of your kernel. When there are loops, there is a back edge to the block or its previous blocks, depending on the loop structure, for example, nested loops. Loops usually impose II bottlenecks and are a main focus for optimization.

A block has three main parts — an input or loop input node, a set of instructions and a branch node. The input and branch nodes may not be present depending on if there is branching in or out of the block. The input or loop input node determines the initial value for variables depending on where the branch into this block originated. The rest of the block contains stallable and non-stallable instructions, and clusters. A well-optimized design should contain a minimal number of stallable instructions, such as stallable I/O or memory accesses.

The non-stallable instructions within in block are grouped into clusters to reduce handshaking overheads with stallable instructions. A cluster has an entry and an exit node. There is only a stall-free cluster. You can find the exit FIFO information in the cluster’s exit node. Finally, the branch node informs the next block to go to, under which condition.

In the HLD report, you can find different views of your kernel, under the Views drop-down menu. For more information, refer to Using Views.

The Intel® FPGA SDK for OpenCL™ Offline Compiler compiles a kernel that does not use any built-in work-item functions, such as get_global_id() and get_local_id(), as a single work-item kernel. Otherwise, the offline compiler compiles the kernel as an NDRange kernel. For more information about built-in work-item functions, refer to section 6.11.1: Work-Item Functions of the OpenCL Specification version 1.0.

For single work-item kernels, the offline compiler attempts to pipeline every loop in the kernel to allow multiple loop iterations to execute concurrently. Kernel performance might degrade if the compiler cannot pipeline some of the loops effectively, or if it cannot pipeline the loops at all.

The offline compiler cannot pipeline loops in NDRange kernels. However, these loops can accept multiple work-items simultaneously. A kernel might have multiple loops, each with nested loops. If you tabulate the total number of iterations of nested loops for each outer loop, kernel throughput is usually reduced by the largest total iterations value that you have tabulated. To execute an NDRange kernel efficiently, there must a large number of threads.