Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

ID 683176
Date 9/24/2018
Document Table of Contents

5.3. Good Design Practices for Single Work-Item Kernel

If your OpenCL™ kernels contain loop structures, follow the -recommended guidelines to construct the kernels in a way that allows the to analyze them effectively. Well-structured loops are particularly important when you direct the offline compiler to perform pipeline parallelism execution in loops.

Avoid Pointer Aliasing

Insert the restrict keyword in pointer arguments whenever possible. Including the restrict keyword in pointer arguments prevents the offline compiler from creating unnecessary memory dependencies between non-conflicting read and write operations. Consider a loop where each iteration reads data from one array, and then it writes data to another array in the same physical memory. Without including the restrict keyword in these pointer arguments, the offline compiler might assume dependence between the two arrays, and extracts less pipeline parallelism as a result.

Construct "Well-Formed" Loops

A "well-formed" loop has an exit condition that compares against an integer bound, and has a simple induction increment of one per iteration. Including "well-formed" loops in your kernel improves performance because the offline compiler can analyze these loops efficiently.

The following example is a "well-formed" loop:

for (i = 0; i < N; i++) {
Important: "Well-formed" nested loops also contribute to maximizing kernel performance.

The following example is a "well-formed" nested loop structure:

for (i = 0; i < N; i++) {
   for(j = 0; j < M; j++) {

Minimize Loop-Carried Dependencies

The loop structure below creates a loop-carried dependence because each loop iteration reads data written by the previous iteration. As a result, each read operation cannot proceed until the write operation from the previous iteration completes. The presence of loop-carried dependencies decreases the extent of pipeline parallelism that the offline compiler can achieve, which reduces kernel performance.

for (int i = 0; i < N; i++) {
    A[i] = A[i - 1] + i;

The offline compiler performs a static memory dependence analysis on loops to determine the extent of parallelism that it can achieve. In some cases, the offline compiler might assume dependence between two array accesses, and extracts less pipeline parallelism as a result. The offline compiler assumes loop-carried dependence if it cannot resolve the dependencies at compilation time because of unknown variables, or if the array accesses involve complex addressing.

To minimize loop-carried dependencies, following the guidelines below whenever possible:

  • Avoid pointer arithmetic.

    Compiler output is suboptimal when the kernel accesses arrays by dereferencing pointer values derived from arithmetic operations. For example, avoid accessing an array in the following manner:

    for (int i = 0; i < N; i++) {
        int t = *(A++);
        *A = t;
  • Introduce simple array indexes.

    Avoid the following types of complex array indexes because the offline compiler cannot analyze them effectively, which might lead to suboptimal compiler output:

    • Nonconstants in array indexes.

      For example, A[K + i], where i is the loop index variable and K is an unknown variable.

    • Multiple index variables in the same subscript location.

      For example, A[i + 2 × j], where i and j are loop index variables for a double nested loop.

      Note: The offline compiler can analyze the array index A[i][j] effectively because the index variables are in different subscripts.
    • Nonlinear indexing.

      For example, A[i & C], where i is a loop index variable and C is a constant or a nonconstant variable.

  • Use loops with constant bounds in your kernel whenever possible.

    Loops with constant bounds allow the offline compiler to perform range analysis effectively.

Avoid Complex Loop Exit Conditions

The offline compiler evaluates exit conditions to determine if subsequent loop iterations can enter the loop pipeline. There are times when the offline compiler requires memory accesses or complex operations to evaluate the exit condition. In these cases, subsequent iterations cannot launch until the evaluation completes, decreasing overall loop performance.

Convert Nested Loops into a Single Loop

To maximize performance, combine nested loops into a single form whenever possible. Restructuring nested loops into a single loop reduces hardware footprint and computational overhead between loop iterations.

The following code examples illustrate the conversion of a nested loop into a single loop:

Nested Loop Converted Single Loop
for (i = 0; i < N; i++) {
    for (j = 0; j < M; j++) {
for (i = 0; i < N*M; i++) {

Declare Variables in the Deepest Scope Possible

To reduce the hardware resources necessary for implementing a variable, declare the variable prior to its use in a loop. Declaring variables in the deepest scope possible minimizes data dependencies and hardware usage because the offline compiler does not need to preserve the variable data across loops that do not use the variables.

Consider the following example:

int a[N];
for (int i = 0; i < m; ++i) {
    int b[N];
    for (int j = 0; j < n; ++j) {
        // statements

The array a requires more resources to implement than the array b. To reduce hardware usage, declare array a outside the inner loop unless it is necessary to maintain the data through iterations of the outer loop.

Tip: Overwriting all values of a variable in the deepest scope possible also reduces the resources necessary to present the variable.