Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

device_global Extension (Beta)

The device_globalclass introduces device-scoped memory allocations into SYCLthat can be accessed within a kernel using C++ global variable syntax. These memory allocations have unique instances per SYCL device. Like global variables, device_global are declared at namespace scope and visible to all kernels within that scope.

NOTICE:

In the current oneAPI release, the device_global class only supports a few features on FPGAs that are described in this topic. Subsequent oneAPI releases might include more features of this class. For detailed information about the device_global API, refer to the SYCL device_global Language Specification.

The device_global class is a class template parameterized by the underlying allocation type and a list of optional properties (see Table 1). The type of the allocation also encodes the size of the allocation for potentially multidimensional array types.

Declaring the device_global Class

The following is an example of the device_global class declaration that consists of an int array annotated with device_image_scope and host_access_none properties:

#include <sycl/sycl.hpp> 

namespace exp = sycl::ext::oneapi::experimental; 

using FPGAProperties = decltype(exp::properties( 
  exp::device_image_scope, exp::host_access_none));  
exp::device_global<int[10], FPGAProperties> val; 

In the above example, the device_image_scope property limits the lifetime of the val variable. The host_access_none property asserts that the host code does not copy to or from the device_global. These properties are further described in Table 1.

NOTE:

The current implementation does not support the device_global declaration without device_image_scope and host_access_none properties.

Accessing the device_global Class

The following is an example of accessing the device_global class:

int main () { 
  sycl::queue q; 
  q.single_task([=] { 
    val[0] = 42; 
   	... 
  }); 
}

In most cases, you can directly access the underlying data through overloads of common operators.

Properties of the device_global Class

The following table describes several compile-time-constant properties that the device_global class supports:

Properties of the device_global Class
Property Description
device_image_scope

Mandatory property for FPGAs.

If you set this property, the device_global memory lifetime begins when you program the FPGA binary image on an FPGA and ends when you program a different FPGA binary image on that FPGA.

host_access

Dictates how the host code accesses the device_global class and enables compiler optimizations.

Currently, it supports only the following value, with plans to support more values in future oneAPI releases:

  • none: Asserts that the host code never copies to or from the variable. For an FPGA device, no external ports are exposed.
NOTE:

For additional information, refer to the FPGA tutorial sample "Device Global" listed in the Intel® oneAPI Samples Browser on Linux* or Windows*, or access the code sample on GitHub.

Hardware Implementation

Global variables are not captured implicitly by the lambda, so the compiler need not pass device_global variables as kernel arguments to the generated IP core.

A device_global variable with a device_image_scope property is implemented in on-chip memory, which carries all the on-chip memory advantages, as described in Perform Kernel Computations Using Local or Private Memory.

If you use a compiler-generated IP core in your hardware system, the values stored in a device_global can only be reinitialized by reprogramming the device image on to the FPGA.

Use Cases

The following is one of the typical use cases of the device_global class:

Save State Across Re-entrant Kernels

A device_global class is helpful if your design executes a kernel multiple times and needs to save state across multiple invocations. One alternative is to save the state in global memory. However, this requires an expensive LSU to access that memory. Using the device_global class eliminates the need for that LSU, as the data is placed on-chip.

The following example illustrates this use case by executing code on every second invocation of the kernel:

#include <sycl/sycl.hpp>

namespace exp = sycl::ext::oneapi::experimental;

using FPGAProperties = decltype(exp::properties(
  exp::device_image_scope, exp::host_access_none));
exp::device_global<int, FPGAProperties> counter;

int main () {
  sycl::queue q;
  for (int i = 0; i < 10; i++) {
    q.single_task([=] {
      counter++;
      if (counter.get() % 2)
        // do something
    }).wait();  
  }
}
NOTE:

Scalar accesses require use of the get() method, while array accesses can use the [] operator.