ID 683176
Date 9/24/2018
Public

## 3.8. Avoid Expensive Functions

Some functions are expensive to implement in FPGAs. Expensive functions might decrease kernel performance or require a large amount of hardware to implement.

The following functions are expensive:

• Integer division and modulo (remainder) operators
• Most floating-point operators except addition, multiplication, absolute value, and comparison
Note: For more information on optimizing floating-point operations, refer to the Optimize Floating-Point Operations section.
• Atomic functions

In contrast, inexpensive functions have minimal effects on kernel performance, and their implementation consumes minimal hardware.

The following functions are inexpensive:

• Binary logic operations such as AND, NAND, OR, NOR, XOR, and XNOR
• Logical operations with one constant argument
• Shift by constant
• Integer multiplication and division by a constant that is a power of two

If an expensive function produces a new piece of data for every work-item in a work-group, it is beneficial to code it in a kernel. On the contrary, the code example below shows a case of an expensive floating-point operation (division) executed by every work-item in the NDRange:

__kernel void myKernel (__global const float * restrict a,
__global float * restrict b,
const float c, const float d)
{
size_t gid = get_global_id(0);

//inefficient since each work-item must calculate c divided by d
b[gid] = a[gid] * (c / d);
}


The result of this calculation is always the same. To avoid this redundant and hardware resource-intensive operation, perform the calculation in the host application and then pass the result to the kernel as an argument for all work-items in the NDRange to use. The modified code is shown below:

__kernel void myKernel (__global const float * restrict a,
__global float * restrict b,
const float c_divided_by_d)
{
size_t gid = get_global_id(0);

/*host calculates c divided by d once and passes it into
kernel to avoid redundant expensive calculations*/
b[gid] = a[gid] * c_divided_by_d;
}

The consolidates operations that are not work-item-dependent across the entire NDRange into a single operation. It then shares the result across all work-items. In the first code example, the offline compiler creates a single divider block shared by all work-items because division of c by d remains constant across all work-items. This optimization helps minimize the amount of redundant hardware. However, the implementation of an integer division requires a significant amount of hardware resources. Therefore, it is beneficial to off-load the division operation to the host processor and then pass the result as an argument to the kernel to conserve hardware resources.