Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

ID 683176
Date 9/24/2018
Public
Document Table of Contents

3.8. Avoid Expensive Functions

Some functions are expensive to implement in FPGAs. Expensive functions might decrease kernel performance or require a large amount of hardware to implement.

The following functions are expensive:

  • Integer division and modulo (remainder) operators
  • Most floating-point operators except addition, multiplication, absolute value, and comparison
    Note: For more information on optimizing floating-point operations, refer to the Optimize Floating-Point Operations section.
  • Atomic functions

In contrast, inexpensive functions have minimal effects on kernel performance, and their implementation consumes minimal hardware.

The following functions are inexpensive:

  • Binary logic operations such as AND, NAND, OR, NOR, XOR, and XNOR
  • Logical operations with one constant argument
  • Shift by constant
  • Integer multiplication and division by a constant that is a power of two

If an expensive function produces a new piece of data for every work-item in a work-group, it is beneficial to code it in a kernel. On the contrary, the code example below shows a case of an expensive floating-point operation (division) executed by every work-item in the NDRange:

__kernel void myKernel (__global const float * restrict a,
                        __global float * restrict b,
                        const float c, const float d)
{
   size_t gid = get_global_id(0);
   
   //inefficient since each work-item must calculate c divided by d
   b[gid] = a[gid] * (c / d); 
}

The result of this calculation is always the same. To avoid this redundant and hardware resource-intensive operation, perform the calculation in the host application and then pass the result to the kernel as an argument for all work-items in the NDRange to use. The modified code is shown below:

__kernel void myKernel (__global const float * restrict a,
                        __global float * restrict b,
                        const float c_divided_by_d)
{
   size_t gid = get_global_id(0);

   /*host calculates c divided by d once and passes it into  
   kernel to avoid redundant expensive calculations*/   
   b[gid] = a[gid] * c_divided_by_d;  						
}

The consolidates operations that are not work-item-dependent across the entire NDRange into a single operation. It then shares the result across all work-items. In the first code example, the offline compiler creates a single divider block shared by all work-items because division of c by d remains constant across all work-items. This optimization helps minimize the amount of redundant hardware. However, the implementation of an integer division requires a significant amount of hardware resources. Therefore, it is beneficial to off-load the division operation to the host processor and then pass the result as an argument to the kernel to conserve hardware resources.

Did you find the information on this page useful?

Characters remaining:

Feedback Message