Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Avoid Expensive Functions

Some functions are expensive to implement in FPGAs. Expensive functions might decrease kernel performance or require a large amount of hardware to implement.

The following functions are expensive:

  • Integer division and modulo (remainder) operators
  • Most floating-point operators except addition, multiplication, absolute value, and comparison. For more information about optimizing floating-point operations, refer to the Optimize Floating-point Operation section.
  • Atomic operations. For more information, refer to the Memory Model and Atomics in the Data Parallel C++ book and Atomic Operations topic in the SYCL™ 2020 specification.

In contrast, inexpensive functions have minimal effects on kernel performance, and their implementation consumes minimal hardware.

The following functions are inexpensive:

  • Binary logic operations such as AND, NAND, OR, NOR, XOR, and XNOR
  • Logical operations with one constant argument
  • Shift by constant
  • Integer multiplication and division by a constant that is a power of two

If an expensive function produces a new piece of data for every work item in a work group, it is beneficial to code it in a kernel.

On the contrary, the following code example depicts a case of an expensive floating-point operation (division) executed by every work item in the NDRange:

// this function is used in kernel code 
void myKernel (accessor<int, access::mode::read, access::target::global_buffer> a, 
accessor<int, access::mode::read, access::target::global_buffer> b, sycl::id<1> wiID,  
const float c, 
const float d) 
{    
   //inefficient since each work-item must calculate c divided by d 
   b[wiID ] = a[wiID ] * (c / d);  
}

The result of this calculation is always the same. To avoid this redundant and hardware resource-intensive operation, perform the calculation in the host application and then pass the result to the kernel as an argument for all work items in the NDRange to use. The modified code is shown in the following:

void myKernel (accessor<int, access::mode::read, access::target::global_buffer> a, 
accessor<int, access::mode::read, access::target::global_buffer> b, sycl::id<1> wiID, const float c_divided_by_d) 
{ 
   /*host calculates c divided by d once and passes it into   
   kernel to avoid redundant expensive calculations*/    
   b[wiID ] = a[wiID ] * c_divided_by_d;   
}

The Intel® oneAPI DPC++/C++ Compiler consolidates operations that are not work-item-dependent across the entire NDRange into a single operation. It then shares the result across all work items. In the first code example, the Intel® oneAPI DPC++/C++ Compiler creates a single divider block shared by all work-items because division of c by d remains constant across all work-items. This optimization helps minimize the amount of redundant hardware.

However, implementing an integer division requires a significant amount of hardware resources. In this case, it is beneficial to offload the division operation to the host processor and then pass the result as an argument to the kernel to conserve hardware resources.