Visible to Intel only — GUID: nrf1517935215698
Ixiasoft
Visible to Intel only — GUID: nrf1517935215698
Ixiasoft
3.8. Avoid Expensive Functions
The following functions are expensive:
- Integer division and modulo (remainder) operators
- Most floating-point operators except addition, multiplication, absolute value, and comparison
Note: For more information on optimizing floating-point operations, refer to the Optimize Floating-Point Operations section.
- Atomic functions
In contrast, inexpensive functions have minimal effects on kernel performance, and their implementation consumes minimal hardware.
The following functions are inexpensive:
- Binary logic operations such as AND, NAND, OR, NOR, XOR, and XNOR
- Logical operations with one constant argument
- Shift by constant
- Integer multiplication and division by a constant that is a power of two
If an expensive function produces a new piece of data for every work-item in a work-group, it is beneficial to code it in a kernel. On the contrary, the code example below shows a case of an expensive floating-point operation (division) executed by every work-item in the NDRange:
__kernel void myKernel (__global const float * restrict a,
__global float * restrict b,
const float c, const float d)
{
size_t gid = get_global_id(0);
//inefficient since each work-item must calculate c divided by d
b[gid] = a[gid] * (c / d);
}
The result of this calculation is always the same. To avoid this redundant and hardware resource-intensive operation, perform the calculation in the host application and then pass the result to the kernel as an argument for all work-items in the NDRange to use. The modified code is shown below:
__kernel void myKernel (__global const float * restrict a,
__global float * restrict b,
const float c_divided_by_d)
{
size_t gid = get_global_id(0);
/*host calculates c divided by d once and passes it into
kernel to avoid redundant expensive calculations*/
b[gid] = a[gid] * c_divided_by_d;
}
The consolidates operations that are not work-item-dependent across the entire NDRange into a single operation. It then shares the result across all work-items. In the first code example, the offline compiler creates a single divider block shared by all work-items because division of c by d remains constant across all work-items. This optimization helps minimize the amount of redundant hardware. However, the implementation of an integer division requires a significant amount of hardware resources. Therefore, it is beneficial to off-load the division operation to the host processor and then pass the result as an argument to the kernel to conserve hardware resources.