Single Work-item Kernel Design Guidelines

Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

Download PDF

ID 767853

Date 12/16/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-DF734F12-A7A7-4B4A-93BF-EC61AD2EBE6F

View Details

Single Work-item Kernel Design Guidelines

If your kernels contain loop structures, follow the Intel®-recommended guidelines to construct the kernels in a way that allows the Intel® oneAPI DPC++/C++ Compiler to analyze them effectively. Well-structured loops are particularly important to aid the compiler in generating a pipeline parallel datapath for loops.

Avoid Pointer Aliasing

If your kernels have pointer arguments, you can improve the throughput of the design if the Intel® oneAPI DPC++/C++ Compiler can prove those arguments never point to the same memory location. It is possible to provide the compiler with information about pointer arguments in kernels. For more information, refer to Ignoring Dependencies Between Accessor Arguments.

Construct "Well-Formed" Loops

A well-formed loop has an exit condition that compares against an integer bound and has a simple induction increment. Including well-formed loops in your kernel improves performance because the Intel® oneAPI DPC++/C++ Compiler can analyze these loops efficiently.

The following example is a well-formed loop:


for (i = 0; i < N; i++) {
  //statements
}

NOTE:

Well-formed nested loops also contribute to maximizing kernel performance.

The following example is a well-formed nested loop structure:


for (i = 0; i < N; i++) {
  //statements
  for(j = 0; j < M; j++) {
    //statements
  }
}

Minimize Loop-Carried Dependencies

The following loop structure creates a loop-carried dependence because each loop iteration reads data written by the previous iteration:


for (int i = 0; i < N; i++) {
  A[i] = A[i - 1] + i;
}

As a result, each read operation cannot proceed until the write operation from the previous iteration completes. The presence of loop-carried dependencies decreases the extent of pipeline parallelism that the Intel® oneAPI DPC++/C++ Compiler can achieve, which reduces kernel performance.

The Intel® oneAPI DPC++/C++ Compiler performs a static memory dependence analysis on loops to determine the extent of parallelism that it can achieve. In some cases, the Intel® oneAPI DPC++/C++ Compiler might assume loop-carried dependence:

Between two array accesses and as a result, extract less pipeline parallelism.
If it cannot resolve the dependencies at compilation time because of unknown variables or complex indexing expressions.

To minimize loop-carried dependencies, follow these guidelines whenever possible:

Avoid pointer arithmetic. Compiler output is suboptimal when the kernel accesses arrays by dereferencing pointer values derived from arithmetic operations. For example, avoid accessing an array in the following manner:
```
for (int i = 0; i < N; i++) {
  int t = *(A++);
  *A = t;
}
```
Introduce simple, affine array indexes. Avoid the following types of complex array indexes because the Intel® oneAPI DPC++/C++ Compiler cannot analyze them effectively, which might lead to suboptimal compiler output:
- Non-constants in array indexes. For example, A[K + i], where i is the loop index variable and K is an unknown variable.
- Multiple index variables in the same subscript location. For example, A[i + 2 × j], where i and j are loop index variables for a double nested loop.

NOTE:

The Intel® oneAPI DPC++/C++ Compiler can analyze the array index A[i][j] effectively because the index variables are in different subscripts.

Avoid Complex Loop Exit Conditions

The Intel® oneAPI DPC++/C++ Compiler evaluates exit conditions to determine if subsequent loop iterations can enter the loop pipeline. Occasionally, the Intel® oneAPI DPC++/C++ Compiler requires memory accesses or complex operations to evaluate the exit condition. In these cases, subsequent iterations cannot launch until the evaluation completes, decreasing the overall loop performance.

Convert Nested Loops into a Single Loop

To maximize performance, combine nested loops into a single form whenever possible. Restructuring nested loops into a single loop reduces hardware footprint and computational overhead between loop iterations.

The following code examples illustrate the conversion of a nested loop into a single loop:

Conversion of a Nested Loop into a Single Loop
Nested Loop	Converted Single Loop
`for (i = 0; i < N; i++) { //statements for (j = 0; j < M; j++) { //statements } //statements }`	`for (i = 0; i < N*M; i++) { //statements }`

Avoid Conditional Loops

To maximize performance, avoid declaring conditional loops. Conditional loops are tuples of loops that are declared within conditional statements such that one and only one of the loops is expected to be reached. These loops cannot be efficiently parallelized and result in a serialized implementation.

The following code examples illustrate the conversion of conditional loops to a more optimal implementation:

Conversion of a Conditional Loop to an Optimized Loop
Conditional Loops	Converted Loop
`if (condition) { for (int i = 0; i < m; i++) { // statements } } else { for (int i = 0; i < m; i++) { // statements } }`	`for (int i = 0; i < m; i++) { if (condition) { // statements } else { // statements } }`

Conditional Loops

Converted Loop


if (condition) {
  for (int i = 0; i < m; i++) {
    // statements
  }
}
else {
  for (int i = 0; i < m; i++) {
    // statements
  }
}


for (int i = 0; i < m; i++) {
  if (condition) {
    // statements
  }
  else {
    // statements
  }
}

Declare Variables in the Deepest Scope Possible

To reduce hardware resources necessary for implementing a variable, declare the variable prior to its use in a loop. Declaring variables in the deepest scope possible minimizes data dependencies and hardware use because the Intel® oneAPI DPC++/C++ Compiler does not need to preserve the variable data across loops that do not use variables.

Consider the following example:


int a[N];
for (int i = 0; i < m; ++i) {
  int b[N];
  for (int j = 0; j < n; ++j) {
    // statements
  }
}

The array a requires more resources to implement than the array bbecause array a is declared at a broader scope. To reduce hardware use, declare array a outside the inner loop unless it is necessary to maintain the data through iterations of the outer loop, as shown in the following:



for (int i = 0; i < m; ++i) {
  int a[N];
  int b[N];
		for (int j = 0; j < n; ++j) {
    // statements
  }
}

TIP:

Overwriting all values of a variable in the deepest scope possible also reduces resources necessary to present the variable.

Parent topic: Single Work-item Kernels

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

FPGA Optimization Guide for Intel® oneAPI Toolkits

Single Work-item Kernel Design Guidelines

Avoid Pointer Aliasing

Construct "Well-Formed" Loops

Minimize Loop-Carried Dependencies

Avoid Complex Loop Exit Conditions

Convert Nested Loops into a Single Loop

Avoid Conditional Loops

Declare Variables in the Deepest Scope Possible