Using the oneAPI Level Zero Interface

A Brief Introduction to the Level Zero API 

The oneAPI specification simplifies software development by providing the same language, API, and programming model across accelerator architectures. It defines a set of APIs for common data parallel domains, across a variety of architectures. Both the API and the direct programming approaches are based on data parallelism (i.e., the same computation is performed on each data element). The oneAPI platform consists of a host and a collection of accelerator devices (Figure 1). The API programming model is implemented using oneMKL, oneDPL, oneDNN, oneCCL, and other libraries. Direct programming is done using DPC++.

Figure 1. The model for running the oneAPI platform. Note that the Intel implementation of the oneAPI specification also contains programming tools.

Level Zero: Introduction

The oneAPI Level Zero provides a low-level, direct-to-metal interface for the devices in a oneAPI platform. Level Zero provides support for broad language features in addition to fine-grained explicit controls/APIs for device discovery, memory allocation, inter-process communication, kernel submission, synchronization, and metrics reporting. It has an API that exposes both the logical and physical abstractions of the underlying devices. While heavily influenced by other low-level APIs (i.e., OpenCL™ API*), Level Zero is designed to evolve independently. It has support for GPUs and other compute devices, such as FPGAs. Most applications should not require the additional control provided by the Level Zero API. It is intended for the explicit controls needed by higher-level runtime APIs and libraries:

  • Device discovery and partitioning
  • Kernel execution and scheduling
  • Peer-to-peer communication
  • Metrics discovery and profiling
  • Kernel profiling, instrumentation
  • System management, query power, performance

The Level Zero C APIs are provided to applications by a shared import library. So, C/C++ applications must include “ze_api.h” and link with “ze_api.lib” (or a shared library).

Level Zero: APIs

Level Zero APIs are categorized into Core, Tools, and System Programming, but we will only discuss the Core Programming APIs (Figure 2) in this article. It has support for devices, drivers, contexts, memory, command queues/lists, synchronization, barriers, modules, and kernels. Tables 1 and 2 list the most commonly used APIs. Figure 3 shows the execution flow using Level Zero.

Figure 2. Components of the Core Programming APIs

 

APIs for Device, Context, Queue Short description
zeInit, zeDriverGet Initialize and discover all the drivers
zeDeviceGet, zeDeviceGetProperties Find a driver instance with a DEVICE_TYPE
zeContextCreate Create a context

zeMemAllocHost, zeMemAllocDevice,
zeMemAllocShared

Allocate memory on Host, Device or shared
zeCommandQueueCreate Create a command queue
zeCommandListCreate Create a command list
zeCommandQueueExecuteCommandLists Execute command list in command queue
zeCommandQueueSynchronize Synchronize host and device


Table 1. Level Zero APIs for Device, Context, and Queue

Figure 3. High-level flow: command lists, queues, module, and kernel execution on a device

 

APIs for Synchronization, Modules, and Kernels Short description
zeEventPoolCreate Create event pool
zeCommandListAppendLaunchKernel Appends the kernel, its arguments, signals to a command list
zeEventHostSynchronize Wait on event to complete
zeModuleCreate Creates a module by compiling IL code or load of a native binary
zeKernelCreate Reference a kernel within a module
zeKernelSetArgumentValue Setup arguments for kernel launch


Table 2. Level Zero APIs for synchronization and kernel-related functions

 

Listing 1 shows the main program that has driver and device discovery. Listing 2 shows kernel execution in function RunTest.

// Driver code (main)
int main(int argc, char *argv[])
{
 zeInit(ZE_INIT_FLAG_GPU_ONLY);
 ze_driver_handle_t driverHandle;
 zeDriverGet(&driverCount, &driverHandle);
 uint32_t deviceCount = 1;
 ze_device_handle_t device;
 zeDeviceGet(driverHandle, &deviceCount, &device);
 ze_device_properties_t deviceProperties = {};
 zeDeviceGetProperties(device, &deviceProperties);
 uint32_t subDeviceCount = 0;
 zeDeviceGetSubDevices(device, &subDeviceCount, nullptr);
 ze_device_handle_t subDevices[2] = {};
 zeDeviceGetSubDevices(device, &subDeviceCount, subDevices);
 for (uint32_t i = 0; i < subDeviceCount; i++) {
 ze_device_properties_t deviceProperties = {};
 zeDeviceGetProperties(subDevices[i], &deviceProperties);
 }
 RunTest(driverHandle, subDevices, device, subDeviceCount, outputValidBool);
}

Listing 1. Level Zero example for driver and device discovery​

void RunTest(ze_driver_handle_t &driverHandle, ze_device_handle_t *subDevice,
 ze_device_handle_t rootDevice, uint32_t subDeviceCount,
 bool &validRet)
{
 // variables initialization, host memory allocation
 ...
 // create a context, command queue and list.
 zeContextCreate(driverHandle, &contextDesc, &context);
 ...
 for (uint32_t i=0; i<num_tiles; i++)
 {
 zeCommandQueueCreate(context, subDevice[i], &cmdQueueDesc, &cmdQueue[i]);
 zeCommandListCreate(context, subDevice[i], &cmdListDesc, &cmdList[i]);
 }
 ...
 // load the IL file (SPIRV format) which has the kernels to run on device
 const char *modulePath = "Gpu_Module_Kernel.spv";
 uint32_t spirvSize = 0;
 auto spirvModule = readBinaryFile(modulePath, spirvSize);
 ...
 zeModuleCreate(context, subDevice[i], &moduleDesc, &module[i], nullptr);
 zeKernelCreate(module[i], &kernelDesc, &kernel[i]);
 zeKernelSetGroupSize(kernel[i], groupSizeX, groupSizeY, groupSizeZ);
 ...
 // allocate device memory, append memory Copy instruction to the command list.
 for (uint32_t i=0; i<num_tiles; i++)
 {
 zeMemAllocDevice(context, &deviceDesc, bufferWidth*sizeof(float), 0,
 subDevice[i], &d_input[i]);
 zeCommandListAppendMemoryCopy(cmdList[i], d_input[i], input[i],
 bufferWidth*sizeof(float), nullptr, 0, nullptr);
 ...
 }
 // Copy data from host to device (execute the commands to allocate, copy).
 for (uint32_t i=0; i<num_tiles; i++) {
 zeCommandListClose(cmdList[i]);
 zeCommandQueueExecuteCommandLists(cmdQueue[i], 1, &cmdList[i], nullptr);
 }
 for (uint32_t i=0; i<num_tiles; i++) {
 zeCommandQueueSynchronize(cmdQueue[i], UINT32_MAX);
 }
 for (uint32_t i=0; i<num_tiles; i++) {
 zeCommandListReset(cmdList[i]);
 }
 ...
 // Set the kernel arguments
 for (uint32_t i=0; i<num_tiles; i++)
 {
 arg_indx = 0;
 start_idx = i * segment_size;

 zeKernelSetArgumentValue(kernel[i], arg_indx++, sizeof(d_input[i]),
 &d_input[i]);
 zeKernelSetArgumentValue(kernel[i], arg_indx++, sizeof(d_input[i]),
 &d_input[i + 1 == num_tiles ? 0 : i + 1]);
 ...
 }
 // Create an event pool, append it to the kernel launch command.
 ze_event_pool_handle_t eventPool;
 zeCommandListAppendLaunchKernel(cmdList[i], kernel[i], &group_count,
 kernelTsEvent[i], 0, nullptr);

// Execute the command list, synchronize commands execution in the Queue.
 for (uint32_t i=0; i<num_tiles; i++) {
 zeCommandListClose(cmdList[i]);
 zeCommandQueueExecuteCommandLists(cmdQueue[i], 1, &cmdList[i], nullptr);
 }
 for (uint32_t i=0; i<num_tiles; i++) {
 zeCommandQueueSynchronize(cmdQueue[i], UINT32_MAX);
 }
 for (uint32_t i=0; i<num_tiles; i++) {
 zeCommandListReset(cmdList[i]);
 }
 // Get kernel event stats, compute execution duration.
 for (uint32_t i=0; i<num_tiles; i++)
 {
 zeEventQueryKernelTimestamp(kernelTsEvent[i], &kernelTsResults);
 uint64_t kernelDuration = kernelTsResults.context.kernelEnd –
 kernelTsResults.context.kernelStart;
 }
 // Copy data from device to host.
 for (uint32_t i=0; i<num_tiles; i++)
 {
 zeCommandListAppendMemoryCopy(cmdList[i], output[i], d_output[i],
 bufferWidth*sizeof(float), nullptr, 0, nullptr);
 }
 ...
 // Tear down, destroy the kernel, memory, context, event, and other objects.
 zeEventPoolDestroy(eventPool);
 for (size_t i=0; i<num_tiles; i++) {
 zeMemFree(context, d_input[i]);
 zeKernelDestroy(kernel[i]);
 zeCommandListDestroy(cmdList[i]);
 ...
 }
 zeContextDestroy(context);
 ...
}

Listing 2. Level Zero example for kernel execution​

OpenMP* Example

Level Zero APIs are also generated in the backend when compiling OpenMP* offload code. These API calls are dumped when environment variables LIBOMPTARGET_DEBUG and LIBOMPTARGET_INFO are set to one or more. We show an example, in Listing 3, of an AoS (array-of-structures) being allocated on the device, and data initialized on the host copied to the memory allocated on the device, updated, and transferred back to the host. The logs were generated by setting the two environment variables mentioned above to 99.

Listing 4 shows some of the Level Zero API calls seen in the logs. The zeMemAllocDevice call allocates data on the device, followed by the zeCommandListAppendMemoryCopy call that copies data from the host to the device. Once the kernel computation is finished, the second call to zeCommandListAppendMemoryCopy copies the updated data from device to host. The AoS is then deleted with a call to zeMemFree, after which the device memory is returned to the memory pool. These calls are similar to what we saw in Listing 2. We have only focused on the Level Zero calls that correlate directly to the OpenMP pragmas. The logs will show many more calls that are used to copy the pointers from host to the device, get memory block properties, map the host to the device pointer and clean up the device when done. In addition to this, the Level Zero API also provides other calls that give more control over how the memory is allocated, copied from host to device, or shared between the two.

struct force_data
{
 float Mass;
 int index;
};
#pragma omp declare target
struct force_data *myData;
#pragma omp end declare target
int main()
{
 int a[10], max=10;
 //allocate array of struct on host
 myData = (struct force_data*) malloc(max * sizeof(struct force_data));
 for(int i = 0; i < max; i++)
 {
 myData[i].index = 1; a[i] = 2;
 }
 //1. Allocate data on device
 #pragma omp target enter data map(alloc:myData[0:max])
 {
 //2. Update data on device
 #pragma omp target teams distribute parallel for map(to:a)
 for(int i=0; i < max; i++)
 myData[i].index = myData[i].index + a[i];
 }
//3. Delete data on device
 #pragma omp target exit data map(delete:myData[0:max])
 for(int i=0; i < max; i++)
 printf("%d\n", myData[i].index);
}

Listing 3. OpenMP offload example

Libomptarget (pid:6780) --> Entering OpenMP data region at unknown:31:31 with 1 arguments:
Libomptarget (pid:6780) --> alloc(myData[0:10])[80]
…
Target LEVEL0 RTL (pid:10428) --> ZE_CALLER: zeMemAllocDevice ( context, &deviceDesc,
Size, Align, Device, &mem )
…
Libomptarget (pid:10428) --> Copying data from host to device, HstPtr=…., TgtPtr=….,
Size=80, Name=myData[0:10]
…
Target LEVEL0 RTL (pid:10428) --> Copy Engine is used for data transfer
Target LEVEL0 RTL (pid:10428) --> ZE_CALLER: zeCommandListAppendMemoryCopy ( cmdList,
Dest, Src, Size, nullptr, 0, nullptr )
…
Libomptarget (pid:6780) --> Entering OpenMP kernel at unknown:41:41 with 6 arguments:
Libomptarget (pid:6780) --> tofrom(myData)[8] (implicit)
…
Libomptarget (pid:6780) --> Updating OpenMP data at unknown:46:46 with 1 arguments:
Libomptarget (pid:6780) --> from(myData[0:10])[80]
Libomptarget (pid:6780) --> Copying data from device to host, TgtPtr=0xffffd556aa640000,
HstPtr=0x0000000000d362b0, Size=80, Name=myData[0:10]
Target LEVEL0 RTL (pid:6780) --> Copy Engine is used for data transfer
Target LEVEL0 RTL (pid:6780) --> ZE_CALLER: zeCommandListAppendMemoryCopy ( cmdList, Dest,
Src, Size, nullptr, 0, nullptr )
Libomptarget (pid:6780) --> Deleting tgt data 0xffffd556aa640000 of size 80
Target LEVEL0 RTL (pid:6780) --> Returned device memory 0xffffd556aa640000 to memory pool
Target LEVEL0 RTL (pid:6780) --> ZE_CALLER: zeMemFree ( Context, (void *)block->Base )
Target LEVEL0 RTL (pid:6780) --> ZE_CALLEE: zeMemFree (
Target LEVEL0 RTL (pid:6780) --> hContext = 0x0000000000d33500
Target LEVEL0 RTL (pid:6780) --> ptr = 0xffffd556aa640000

Listing 4. Level Zero API calls generated in the backend for OpenMP offload

Conclusions

Most application developers will not require the additional control provided by the Level Zero API. It is intended mainly for library and framework developers. The Level Zero API provides more finegrained, explicit control over device discovery, memory management, kernel submission, inter-process communication and more. In this article, we have looked at a basic example to become familiar with Level Zero programming. The OpenMP offload example also provides some insights into the set of calls generated in the backend that provide a direct-to-metal interface to the offload accelerator device. The oneAPI Level Zero specification contains complete API details.