Development Reference Guides

Contents

Programming with the Intel® oneAPI Level Zero Backend

This page shows the supported scenarios for multi-card and multi-tile programming with the Intel® oneAPI Level Zero (Level Zero) Backend.

Device Discovery

Root-devices
In this programming model, Intel GPUs are represented as SYCL GPU devices, or root-devices. You can find your root-device with the
sycl-ls
tool. For example:
sycl-ls
Example output:
[opencl:gpu:0] Intel(R) OpenCL HD Graphics, Intel(R) UHD Graphics 630 [0x3e92] 3.0 [21.49.21786] [opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 2.1 [2020.11.11.0.03_160000] [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) UHD Graphics 630 [0x3e92] 1.2 [1.2.21786] [host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]
sycl-ls
shows the devices and platforms of all the SYCL backends, which are seen by the SYCL runtime. The example above shows the CPU (managed by an OpenCL™ backend) and two GPUs that correspond to the single physical GPU (managed by an OpenCL™ or Level Zero backend). There are two options to filter the observable root-devices:
Option One
Use the environment variable
SYCL_DEVICE_FILTER
, which is described in the Environment Variables. For example:
SYCL_DEVICE_FILTER=ext_oneapi_level_zero sycl-ls
Example output:
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) UHD Graphics 630 [0x3e92] 1.2 [1.2.21786]
Option Two
Use a similar API, as described in the Filter Selector, for example, the
filter_selector("ext_oneapi_level_zero")
only sees Level Zero operated devices.
If there are multiple GPUs in a system, they are seen as multiple root-devices. On Linux, you will see multiple SYCL root-devices of the same SYCL platform. On Windows, you will see root-devices of multiple different SYCL platforms.
You can use
CreateMultipleRootDevices=N NEOReadDebugKeys=1
environment variables to emulate multiple GPU cards. For example:
CreateMultipleRootDevices=2 NEOReadDebugKeys=1 SYCL_DEVICE_FILTER=ext_oneapi_level_zero sycl-ls
Example output:
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) UHD Graphics 630 [0x3e92] 1.2 [1.2.21786] [ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) UHD Graphics 630 [0x3e92] 1.2 [1.2.21786]
CreateMultipleRootDevices
is experimental, not validated, and is used for debug/experimental purposes only.
Sub-devices
Some Intel GPU hardware is composed of multiple tiles, where the root-devices can be partitioned into sub-devices that correspond to the physical tiles. For example:
try { vector<device> SubDevices = RootDevice.create_sub_devices< cl::sycl::info::partition_property::partition_by_affinity_domain>( cl::sycl::info::partition_affinity_domain::next_partitionable); }
Each call to
create_sub_devices
returns the same sub-devices in their persistent order. Use the
ZE_AFFINITY_MASK
environment variable to control what sub-devices are exposed by the Level Zero driver. The
partition_by_affinity_domain
is the only type of partitioning supported for Intel GPUs. The
next_partitionable
and
numa
properties are the only partitioning properties supported.
The
CreateMultipleSubDevices=N NEOReadDebugKeys=1
environment variables can be used to emulate multiple tiles of a GPU.
CreateMultipleSubDevices
is experimental, not validated, and is used for debug/experimental purposes only.

Contexts

Contexts are used for resource isolation and sharing. A SYCL context may consist of one or multiple devices. Both root-devices and sub-devices can be found within a single context, but they need to be from the same SYCL platform. A SYCL
kernel_bundle
created against a context with multiple devices is built to each of the root-devices in the context. For a context that consists of multiple sub-devices of the same root-device, only a single build (to that root-device) is needed.

Memory

Unified Shared Memory (USM)
There are multiple ways to allocate memory:
  • malloc_device
    :
    • Allocation can only be accessed by the specified device, but not by other devices in the context or by the host.
    • The data always stays on the device and is the fastest available for kernel execution.
    • Explicit copy is needed for transferring data to the host or other devices in the context.
  • malloc_host
    :
    • Allocation can be accessed by the host and any other device in the context.
    • The data always stays on the host and is accessed via Peripheral Component Interconnect (PCI) from the devices.
    • No explicit copy is needed for synchronizing of the data with the host or devices.
  • malloc_shared
    :
    • Allocation can only be accessed by the host and the specified device.
    • The data can migrate (operated by the Level Zero driver) between the host and the device for faster access.
    • No explicit copy is necessary for synchronizing between the host and the device, but it is needed for other devices in the context.
Memory allocated against a root-device is accessible by all of its sub-devices (tiles). If you are operating on a context with multiple sub-devices of the same root-device, then you can use
malloc_device
on that root-device instead of using the slower
malloc_host
. If you are using
malloc_device
you need an explicit copy out to the host to see the data located there.
Buffers
SYCL buffers that are created against a context and under the hood are mapped to the Level Zero USM allocation. The mapping details are:
  • Allocation on an integrated device is made on the host and is accessible by the host and the device without copying.
  • Memory buffers for context with sub-devices of the same root-device (possibly including the root-device itself) are allocated on that root-device. They are accessible by all the devices in the context. The synchronization with the host is performed by a SYCL runtime with map/unmap performing implicit copies when necessary.
  • Memory buffers for context with devices from different root-devices in it are allocated on host (and are accessible to all devices).

Queues

A SYCL queue is always attached to a single device in a potential multi-device context. The following example scenarios are listed from most to least performant:
Scenario One
Context with a single sub-device in it, where the queue is attached to that sub-device (tile):
  • The execution/visibility is limited to the single sub-device only.
  • This offers the best performance per tile.
For example:
try { vector<device> SubDevices = ...; for (auto &D : SubDevices) { // Each queue is in its own context, no data sharing across them. auto Q = queue(D); Q.submit([&](handler& cgh) {...}); } }
Scenario Two
Context with multiple sub-devices of the same root-device (multi-tile):
  • The queues are attached to the sub-devices, which implement explicit scaling.
  • The root-device should not be passed to this context for better performance.
For example:
try { vector<device> SubDevices = ...; auto C = context(SubDevices); for (auto &D : SubDevices) { // All queues share the same context, data can be shared across queues. auto Q = queue(C, D); Q.submit([&](handler& cgh) {...}); } }
Scenario Three
Context with a single root-device in it, where the queue is attached to that root-device:
  • The work is automatically distributed across all sub-devices/tiles via implicit scaling by the driver.
  • The simplest way to enable multi-tile hardware, but this does not offer possibility to target specific tiles.
For example:
try { // The queue is attached to the root-device, driver distributes to sub-devices, if any. auto D = device(gpu_selector{}); auto Q = queue(D); Q.submit([&](handler& cgh) {...}); }
Scenario Four
Contexts with multiple root-devices (multi-card):
  • The most unrestrictive context with queues attached to different root-devices.
  • Offers most sharing possibilities at the cost of slow access through host memory or explicit copies needed.
For example:
try { auto P = platform(gpu_selector{}); auto RootDevices = P.get_devices(); auto C = context(RootDevices); for (auto &D : RootDevices) { // Context has multiple root-devices, data can be shared across multi-card (requires explicit copying) auto Q = queue(C, D); Q.submit([&](handler& cgh) {...}); } }
Do not forget to allocate/synchronize your memory for your programming model and algorithm.

Multi-tile/card Examples

For your next steps, you can explore two examples of multi-tile and multi-card programming:

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.