Get Started

  • 2021.4
  • 09/27/2021
  • Public Content

Get Started with OpenMP* Offload to GPU for the Intel® oneAPI DPC++/C++ Compiler and Intel® Fortran Compiler (Beta)

The OpenMP* Offload to GPU feature of the Intel® oneAPI DPC++/C++ Compiler and the Intel® Fortran Compiler (Beta) compiles OpenMP source files for a wide range of accelerators. Only the
icx
and
ifx
compilers support the OpenMP Offload feature.

Before You Begin

Visit the Release Notes page for the known issues and most up-to-date information:

OpenMP 5.0/5.1 Support in icx and ifx

The
icx
and
ifx
compilers support most of OpenMP 5.0 and 5.1, which includes device support for GPU target. For more details, go to the OpenMP Support section of the Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference.
OpenMP directives supported in the
icx
and
ifx
compilers for GPU and CPU:
Equivalent keywords for C++ and Fortran are provided in the list below.
  • allocate
  • atomic
  • barrier
  • critical
  • declare simd
  • declare target
  • declare variant
  • dispatch
  • distribute
  • distribute parallel for/do
  • distribute parallel for/do simd
  • distribute simd
  • for/do
  • for/do simd
  • loop
  • master
  • parallel
  • parallel for/do
  • parallel for/do simd
  • parallel loop
  • parallel sections
  • sections
  • simd
  • single
  • target
  • target data
  • target enter data
  • target exit data
  • target parallel
  • target parallel for/do
  • target parallel for/do simd
  • target parallel loop
  • target simd
  • target teams
  • target teams distribute
  • target teams distribute parallel for/do
  • target teams distribute parallel for/do simd
  • target teams distribute simd
  • target teams loop
  • target variant dispatch
    (Intel extension)
  • target update
  • teams
  • teams distribute
  • teams distribute parallel for/do
  • teams distribute parallel for/do simd
  • teams distribute simd
  • teams loop
GPU runtime routines:
Device routines are callable from CPU to get device information and allocate device memory on GPU.
  • omp_get_initial_device
  • omp_get_interop_int
  • omp_get_interop_name
  • omp_get_interop_ptr
  • omp_get_interop_rc_desc
  • omp_get_interop_str
  • omp_get_interop_type_desc
  • omp_get_mapped_ptr
  • omp_get_num_devices
  • omp_get_num_interop_properties
  • omp_is_initial_device
  • omp_target_alloc
  • omp_target_alloc_device
    (Intel extension)
  • omp_target_alloc_host
    (Intel extension)
  • omp_target_alloc_shared
    (Intel extension)
  • omp_target_associate_ptr
  • omp_target_disassociate_ptr
  • omp_target_free
  • omp_target_is_accessible
  • omp_target_is_present
  • omp_target_memcpy
  • omp_target_memcpy_rect
Device runtime routines callable on GPU:
  • omp_get_device_num
  • omp_get_max_threads
  • omp_get_num_devices
  • omp_get_num_procs
  • omp_get_num_teams
  • omp_get_num_threads
  • omp_get_team_num
  • omp_get_team_size
  • omp_get_thread_limit
  • omp_get_thread_num
  • omp_in_parallel
  • omp_is_initial_device
Environment variables:
  • OMP_DEFAULT_DEVICE
    : Set default device
  • OMP_TARGET_OFFLOAD
    : Control offload on device or host
  • LIBOMPTARGET_PLUGIN
    : Choose OpenCL™ or Level0
  • LIBOMPTARGET_DEBUG
    : Display debug information

Options Support for Targeting Gen9

Intel supports two new options:
  • -qopenmp
  • -fopenmp-targets=spir64
that support OpenMP and offloading execution on CPU and GPU. The
-qopenmp
option enables a middle-end that supports the transformation of OpenMP in LLVM* (but not in the Clang* front-end). The
-fopenmp-targets=spir64
option enables the compiler to generate a
x86 + SPIR64
fat binary for the GPU device binary generation.

Restrictions on Gen9 or Later Target Regions

Given that OpenMP offloading is built on top of an OpenCL runtime stack for GPU, the restrictions that apply to any OpenCL kernel functions also apply to OpenMP offloading region code. Below is a list of restrictions:
  • Recursive function calls (unless compile time constant expression)
  • Non-placement new and delete
  • Go-to statement restriction
  • Register and
    thread_local
    storage qualifiers
  • Virtual function qualifier
  • Function pointers (unless compile time constant expression)
  • Virtual functions
  • Exception handling
  • C++ standard library (e.g. only
    printf
    is support for GPU)
  • Implicit Lambda-to-Function Pointer conversion
  • Variadic function
  • Variable Length Arrays (VLA), this is not supported for the tasking model and async-offloading

Example for OpenMP Offloading

The following simple matrix multiplication code example uses OpenMP target, teams, distribute, and parallel for combined construct:
// matmul.cpp: Matrix Multiplication Example using OpenMP Offloading #include <stdio.h> #include <math.h> #include <stdlib.h> #define MAX 128 int A[MAX][MAX], B[MAX][MAX], C[MAX][MAX], C_SERIAL[MAX][MAX]; typedef int BOOL; typedef int TYPE; BOOL check_result(TYPE *actual, TYPE *expected, unsigned n) { for (unsigned i = 0; i < n; i++) { if(actual[i] != expected[i]) { printf("Value mismatch at index = %d. Expected: %d" ", Actual: %d.\n", i, expected[i], actual[i]); return 0; } } return 1; } void __attribute__ ((noinline)) Compute() { #pragma omp target teams distribute parallel for map(to: A, B) map(tofrom: C) \ thread_limit(128) { for (int i = 0; i < MAX; i++) for (int j = 0; j < MAX; j++) for (int k = 0; k < MAX; k++) C[i][j] += A[i][k] * B[k][j]; } } int main() { for (int i = 0; i < MAX; i++) for (int j = 0; j < MAX; j++) { A[i][j] = i + j - 1; B[i][j] = i - j + 1; } for (int i = 0; i < MAX; i++) for (int j = 0; j < MAX; j++) for (int k = 0; k < MAX; k++) C_SERIAL[i][j] += A[i][k] * B[k][j]; Compute(); if (!check_result((int*) &C[0][0], (int*) &C_SERIAL[0][0], MAX * MAX)) { printf("FAILED\n"); return 1; } printf("PASSED\n"); return 0; }

Compilation and Run Commands

On Linux*, GCC* 4.8.5 or higher must be installed for host code compilation. This is to avoid any incompatibilities due to a changed C++ Application Binary Interface (ABI).
  1. Compile
    : Compile the source code with an
    icx
    ,
    icpx
    , or
    ifx
    compiler driver that invokes GPU offloading with:
    $ icx -qopenmp -fopenmp-targets=spir64 matmul_offload.c -o matmul
    OR
    $ icpx -qopenmp -fopenmp-targets=spir64 matmul_offload.cpp -o matmul
    OR
    $ ifx -qopenmp -fopenmp-targets=spir64 matmul_offload.f90 -o matmul
  2. Run
    : Set up the
    OMP_TARGET_OFFLOAD
    environment variable to force offloading or fail:
    $ export OMP_TARGET_OFFLOAD=MANDATORY
    The default is to fallback to host if the execution cannot be run on GPU. Below is an example:
    $ ./matmul PASSED

Enhanced Compiler Integration of GPU-Optimized LIBM Functions

In some cases, there will be multiple variants of a given math function with differing accuracy/performance tradeoffs. The compiler provides a way to choose an appropriate variant based on your compiler options. The OpenMP Offload to GPU Feature of the Intel® oneAPI DPC++/C++ Compiler and the Intel® Fortran Compiler (Beta) supports all fp-models. The fp-model that is supported by Intel® C++ Compiler is migrated as well. Below is the list of math functions that are supported for Gen9 or later, based on OpenCL built-in math functions.
std::unordered_map<std::string, std::string> llvm::vpo::OCLBuiltin = { // float: {"sinf", "_Z3sinf"}, {"cosf", "_Z3cosf"}, {"tanf", "_Z3tanf"}, {"erff", "_Z3erff"}, {"expf", "_Z3expf"}, {"logf", "_Z3logf"}, {"log2f", "_Z4log2f"}, {"powf", "_Z3powff"}, {"sqrtf", "_Z4sqrtf"}, {"fmaxf", "_Z4fmaxff"}, {"llvm.maxnum.f32", "_Z4fmaxff"}, {"fminf", "_Z4fminff"}, {"llvm.minnum.f32", "_Z4fminff"}, {"fabsf", "_Z4fabsf"}, {"llvm.fabs.f32", "_Z4fabsf"}, {"ceilf", "_Z4ceilf"}, {"llvm.ceil.f32", "_Z4ceilf"}, {"floorf", "_Z5floorf"}, {"llvm.floor.f32", "_Z5floorf"}, // double: {"sin", "_Z3sind"}, {"cos", "_Z3cosd"}, {"tan", "_Z3tand"}, {"erf", "_Z3erfd"}, {"exp", "_Z3expd"}, {"log", "_Z3logd"}, {"log2", "_Z4log2d"}, {"pow", "_Z3powdd"}, {"sqrt", "_Z4sqrtd"}, {"fmax", "_Z4fmaxdd"}, {"llvm.maxnum.f64", "_Z4fmaxdd"}, {"fmin", "_Z4fmindd"}, {"llvm.minnum.f64", "_Z4fmindd"}, {"fabs", "_Z4fabsd"}, {"llvm.fabs.f64", "_Z4fabsd"}, {"ceil", "_Z4ceild"}, {"llvm.ceil.f64", "_Z4ceild"}, {"floor", "_Z5floord"}, {"llvm.floor.f64", "_Z5floord"}, {“invsqrtf”, “_Z5rsqrtf”}, {“invsqrt”, “_Z5rsqrtd”}};
The
libomptarget
runtime library has implemented performance profiling for tracking GPU kernel start, complete, and data-transfer time. Set the environment variable
LIBOMPTARGET_PLUGIN_PROFILE=T
to enable this feature. An example of the results is seen below:
================================================================================ LIBOMPTARGET_PLUGIN_PROFILE(LEVEL0) for OMP DEVICE(0) Intel(R) UHD Graphics 630 [0x3e92], Thread 0 -------------------------------------------------------------------------------- -- Kernel 0 : __omp_offloading_3a_dca6bdd1_MAIN___l30 -------------------------------------------------------------------------------- -- Name : Host Time (msec) Device Time (msec) -- Compiling : 2632.679 2632.679 -- DataAlloc : 11.350 11.350 -- DataRead (Device to Host) : 2.633 2.633 -- DataWrite (Host to Device): 8.398 8.398 -- Kernel 0 : 5279.800 5258.379 -- OffloadEntriesInit : 2.731 2.731 -- Total : 7937.591 7916.170 ================================================================================

Early Integration of GPU-specific Debug Information

To provide GPU specific debug, support for the environment variable:
export LIBOMPTARGET_DEBUG=1
was added. This allows for dumping offloading runtime debugging information. Its default value is 0, which indicates no offloading runtime debugging information dump. The following example uses the previous matrix multiplication example to show GPU-specific debug information:
$ icx -v Intel(R) oneAPI DPC++/C++ Compiler 2021.4.0 … $ icx -qopenmp -fopenmp-targets=spir64 matmul.cpp -o matmul $ export LIBOMPTARGET_DEBUG=1 $ ./matmul Libomptarget --> Init target library! Libomptarget --> Initialized OMPT Libomptarget --> Loading RTLs... Libomptarget --> Loading library 'libomptarget.rtl.level0.so'... Target LEVEL0 RTL --> Init Level0 plugin! Target LEVEL0 RTL --> omp_get_thread_limit() returned 2147483647 Target LEVEL0 RTL --> omp_get_max_teams() returned 0 Libomptarget --> Successfully loaded library 'libomptarget.rtl.level0.so'! Target LEVEL0 RTL --> Looking for Level0 devices... Target LEVEL0 RTL --> Initialized L0, API 10002 Target LEVEL0 RTL --> Found 1 driver(s)! Target LEVEL0 RTL --> Found copy command queue for device 0x0000000000e4bc00, ordinal = 1 Target LEVEL0 RTL --> Found a GPU device, Name = Intel(R) Iris(R) Xe MAX Graphics [0x4905] Target LEVEL0 RTL --> Found a GPU device, Name = Intel(R) UHD Graphics 630 [0x3e98] Target LEVEL0 RTL --> No subdevices are found for device 0x0000000000e4bc00 at level 0 Target LEVEL0 RTL --> Could not find multi-context command queue group for device 0x0000000000e4bc00 Target LEVEL0 RTL --> No subdevices are found for device 0x0000000000ea31e0 at level 0 Target LEVEL0 RTL --> Could not find multi-context command queue group for device 0x0000000000ea31e0 Target LEVEL0 RTL --> Found 2 root devices, 2 total devices. Target LEVEL0 RTL --> List of devices (DeviceID[.SubDeviceLevel.SubDeviceID]) Target LEVEL0 RTL --> -- 0 Target LEVEL0 RTL --> -- 1 Target LEVEL0 RTL --> Driver API version is 10001 Target LEVEL0 RTL --> Interop property IDs, Names, Descriptions Target LEVEL0 RTL --> -- 0, device_num_eus, intptr_t, total number of EUs Target LEVEL0 RTL --> -- 1, device_num_threads_per_eu, intptr_t, number of threads per EU Target LEVEL0 RTL --> -- 2, device_eu_simd_width, intptr_t, physical EU simd width Target LEVEL0 RTL --> -- 3, device_num_eus_per_subslice, intptr_t, number of EUs per sub-slice Target LEVEL0 RTL --> -- 4, device_num_subslices_per_slice, intptr_t, number of sub-slices per slice Target LEVEL0 RTL --> -- 5, device_num_slices, intptr_t, number of slices Target LEVEL0 RTL --> Returning 2 top-level devices Libomptarget --> Registering RTL libomptarget.rtl.level0.so supporting 2 devices! Libomptarget --> Optional interface: __tgt_rtl_data_alloc_base Libomptarget --> Optional interface: __tgt_rtl_data_alloc_user Libomptarget --> Optional interface: __tgt_rtl_data_alloc_explicit Libomptarget --> Optional interface: __tgt_rtl_data_alloc_managed Libomptarget --> Optional interface: __tgt_rtl_data_submit_nowait Libomptarget --> Optional interface: __tgt_rtl_data_retrieve_nowait Libomptarget --> Optional interface: __tgt_rtl_create_offload_queue Libomptarget --> Optional interface: __tgt_rtl_release_offload_queue Libomptarget --> Optional interface: __tgt_rtl_get_platform_handle Libomptarget --> Optional interface: __tgt_rtl_set_device_handle Libomptarget --> Optional interface: __tgt_rtl_get_context_handle Libomptarget --> Optional interface: __tgt_rtl_init_ompt Libomptarget --> Optional interface: __tgt_rtl_is_device_accessible_ptr Libomptarget --> Optional interface: __tgt_rtl_manifest_data_for_region Libomptarget --> Optional interface: __tgt_rtl_push_subdevice Libomptarget --> Optional interface: __tgt_rtl_pop_subdevice Libomptarget --> Optional interface: __tgt_rtl_add_build_options Libomptarget --> Optional interface: __tgt_rtl_is_supported_device Libomptarget --> Optional interface: __tgt_rtl_deinit Libomptarget --> Optional interface: __tgt_rtl_create_interop Libomptarget --> Optional interface: __tgt_rtl_release_interop Libomptarget --> Optional interface: __tgt_rtl_use_interop Libomptarget --> Optional interface: __tgt_rtl_get_num_interop_properties Libomptarget --> Optional interface: __tgt_rtl_get_interop_property_value Libomptarget --> Optional interface: __tgt_rtl_get_interop_property_info Libomptarget --> Optional interface: __tgt_rtl_get_interop_rc_desc Libomptarget --> Optional interface: __tgt_rtl_get_num_sub_devices Libomptarget --> Optional interface: __tgt_rtl_is_accessible_addr_range Libomptarget --> Optional interface: __tgt_rtl_run_target_team_nd_region Libomptarget --> Optional interface: __tgt_rtl_run_target_region_nowait Libomptarget --> Optional interface: __tgt_rtl_run_target_team_region_nowait Libomptarget --> Optional interface: __tgt_rtl_run_target_team_nd_region_nowait Target LEVEL0 RTL --> Initialized OMPT Libomptarget --> Loading library 'libomptarget.rtl.opencl.so'... Target OPENCL RTL --> Init OpenCL plugin! Target OPENCL RTL --> omp_get_thread_limit() returned 2147483647 Target OPENCL RTL --> omp_get_max_teams() returned 0 Target OPENCL RTL --> Target device type is set to GPU Libomptarget --> Successfully loaded library 'libomptarget.rtl.opencl.so'! Target OPENCL RTL --> Start initializing OpenCL Target OPENCL RTL --> Platform OpenCL 3.0 has 1 Devices Target OPENCL RTL --> Extension clGetMemAllocInfoINTEL is found. Target OPENCL RTL --> Extension clHostMemAllocINTEL is found. Target OPENCL RTL --> Extension clDeviceMemAllocINTEL is found. Target OPENCL RTL --> Extension clSharedMemAllocINTEL is found. Target OPENCL RTL --> Extension clMemFreeINTEL is found. Target OPENCL RTL --> Extension clSetKernelArgMemPointerINTEL is found. Target OPENCL RTL --> Extension clEnqueueMemcpyINTEL is found. Target OPENCL RTL --> Extension clGetDeviceGlobalVariablePointerINTEL is found. Target OPENCL RTL --> Extension clGetKernelSuggestedLocalWorkSizeINTEL is found. Target OPENCL RTL --> Extension clSetProgramSpecializationConstant is found. Target OPENCL RTL --> Platform OpenCL 3.0 has 1 Devices Target OPENCL RTL --> Extension clGetMemAllocInfoINTEL is found. Target OPENCL RTL --> Extension clHostMemAllocINTEL is found. Target OPENCL RTL --> Extension clDeviceMemAllocINTEL is found. Target OPENCL RTL --> Extension clSharedMemAllocINTEL is found. Target OPENCL RTL --> Extension clMemFreeINTEL is found. Target OPENCL RTL --> Extension clSetKernelArgMemPointerINTEL is found. Target OPENCL RTL --> Extension clEnqueueMemcpyINTEL is found. Target OPENCL RTL --> Extension clGetDeviceGlobalVariablePointerINTEL is found. Target OPENCL RTL --> Extension clGetKernelSuggestedLocalWorkSizeINTEL is found. Target OPENCL RTL --> Extension clSetProgramSpecializationConstant is found. Target OPENCL RTL --> Device 0: Intel(R) Iris(R) Xe MAX Graphics [0x4905] Target OPENCL RTL --> Number of execution units on the device is 96 Target OPENCL RTL --> Maximum work group size for the device is 512 Target OPENCL RTL --> Maximum memory allocation size is 4044357632 Target OPENCL RTL --> Addressing mode is 64 bit Target OPENCL RTL --> Device local mem size: 65536 Target OPENCL RTL --> Device 1: Intel(R) UHD Graphics 630 [0x3e98] Target OPENCL RTL --> Number of execution units on the device is 24 Target OPENCL RTL --> Maximum work group size for the device is 256 Target OPENCL RTL --> Maximum memory allocation size is 4294959104 Target OPENCL RTL --> Addressing mode is 64 bit Target OPENCL RTL --> Device local mem size: 65536 Libomptarget --> Registering RTL libomptarget.rtl.opencl.so supporting 2 devices! Libomptarget --> Optional interface: __tgt_rtl_data_alloc_base Libomptarget --> Optional interface: __tgt_rtl_data_alloc_user Libomptarget --> Optional interface: __tgt_rtl_data_alloc_explicit Libomptarget --> Optional interface: __tgt_rtl_data_alloc_managed Libomptarget --> Optional interface: __tgt_rtl_data_submit_nowait Libomptarget --> Optional interface: __tgt_rtl_data_retrieve_nowait Libomptarget --> Optional interface: __tgt_rtl_create_offload_queue Libomptarget --> Optional interface: __tgt_rtl_release_offload_queue Libomptarget --> Optional interface: __tgt_rtl_get_device_name Libomptarget --> Optional interface: __tgt_rtl_get_platform_handle Libomptarget --> Optional interface: __tgt_rtl_set_device_handle Libomptarget --> Optional interface: __tgt_rtl_get_context_handle Libomptarget --> Optional interface: __tgt_rtl_get_data_alloc_info Libomptarget --> Optional interface: __tgt_rtl_init_ompt Libomptarget --> Optional interface: __tgt_rtl_is_device_accessible_ptr Libomptarget --> Optional interface: __tgt_rtl_manifest_data_for_region Libomptarget --> Optional interface: __tgt_rtl_add_build_options Libomptarget --> Optional interface: __tgt_rtl_is_supported_device Libomptarget --> Optional interface: __tgt_rtl_deinit Libomptarget --> Optional interface: __tgt_rtl_create_interop Libomptarget --> Optional interface: __tgt_rtl_release_interop Libomptarget --> Optional interface: __tgt_rtl_use_interop Libomptarget --> Optional interface: __tgt_rtl_get_num_interop_properties Libomptarget --> Optional interface: __tgt_rtl_get_interop_property_value Libomptarget --> Optional interface: __tgt_rtl_get_interop_property_info Libomptarget --> Optional interface: __tgt_rtl_get_interop_rc_desc Libomptarget --> Optional interface: __tgt_rtl_is_accessible_addr_range Libomptarget --> Optional interface: __tgt_rtl_run_target_team_nd_region Libomptarget --> Optional interface: __tgt_rtl_run_target_region_nowait Libomptarget --> Optional interface: __tgt_rtl_run_target_team_region_nowait Libomptarget --> Optional interface: __tgt_rtl_run_target_team_nd_region_nowait Target OPENCL RTL --> Initialized OMPT Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.x86_64.so': libffi.so.6: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.cuda.so': libomptarget.rtl.cuda.so: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptarget.rtl.aarch64.so: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.ve.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.ve.so': libomptarget.rtl.ve.so: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.amdgpu.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.amdgpu.so': libomptarget.rtl.amdgpu.so: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.rpc.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.rpc.so': libomptarget.rtl.rpc.so: cannot open shared object file: No such file or directory! Libomptarget --> RTLs loaded! Target LEVEL0 RTL --> Target binary is a valid oneAPI OpenMP image. Libomptarget --> Image 0x00000000004021d0 is compatible with RTL libomptarget.rtl.level0.so! Libomptarget --> RTL 0x0000000000cd6f20 has index 0! Libomptarget --> Registering image 0x00000000004021d0 with RTL libomptarget.rtl.level0.so! Libomptarget --> Done registering entries! Libomptarget --> Entering target region with entry point 0x00000000004020b0 and device Id 0 Libomptarget --> Call to omp_get_num_devices returning 2 Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found) Libomptarget --> Call to omp_get_num_devices returning 2 Libomptarget --> Call to omp_get_num_devices returning 2 Libomptarget --> Call to omp_get_initial_device returning 2 Libomptarget --> Checking whether device 0 is ready. Libomptarget --> Is the device 0 (local ID 0) initialized? 0 Target LEVEL0 RTL --> Initialize requires flags to 1 Target LEVEL0 RTL --> Allocated a host memory object 0x0000149189791000 Target LEVEL0 RTL --> Initialized host memory pool for device 0x0000000000000000: AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456 Target LEVEL0 RTL --> Allocated a shared memory object 0x0000149189791000 Target LEVEL0 RTL --> Initialized shared memory pool for device 0x0000000000e4bc00: AllocMax = 65536, Capacity = 1, PoolSizeMax = 268435456 Target LEVEL0 RTL --> Allocated a shared memory object 0x0000149189791000 Target LEVEL0 RTL --> Initialized shared memory pool for device 0x0000000000ea31e0: AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456 Target LEVEL0 RTL --> Allocated a device memory object 0x00003b7070bc0000 Target LEVEL0 RTL --> Initialized device memory pool for device 0x0000000000e4bc00: AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456 Target LEVEL0 RTL --> Allocated a device memory object 0xffffd556aa7e0000 Target LEVEL0 RTL --> Initialized device memory pool for device 0x0000000000ea31e0: AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456 Target LEVEL0 RTL --> Initialized Level0 device 0 Libomptarget --> Device 0 is ready to use. Target LEVEL0 RTL --> Device 0: Loading binary from 0x00000000004021d0 Target LEVEL0 RTL --> Expecting to have 1 entries defined Target LEVEL0 RTL --> Base L0 module compilation options: -cl-std=CL2.0 Target LEVEL0 RTL --> Created module from image #0. Target LEVEL0 RTL --> Kernel 0: Entry = 0x00000000004020b0, Name = __omp_offloading_42_d74c0b80__Z7Computev_l25, NumArgs = 5, Handle = 0x000000000186dff0 Target LEVEL0 RTL --> Looking up device global variable '__omp_spirv_program_data' of size 48 bytes on device 0. Target LEVEL0 RTL --> Global variable lookup succeeded. Target LEVEL0 RTL --> Created a command list 0x0000000001c94a50 for device 0. Target LEVEL0 RTL --> Created a command queue 0x00000000016a2260 for device 0. Libomptarget --> Entry 0: Base=0x00000000004651a0, Begin=0x00000000004651a0, Size=65536, Type=0x23, Name=unknown Libomptarget --> Entry 1: Base=0x00000000004451a0, Begin=0x00000000004451a0, Size=65536, Type=0x21, Name=unknown Libomptarget --> Entry 2: Base=0x00000000004551a0, Begin=0x00000000004551a0, Size=65536, Type=0x21, Name=unknown Libomptarget --> Entry 3: Base=0x0000000000000000, Begin=0x0000000000000000, Size=0, Type=0x120, Name=unknown Libomptarget --> Entry 4: Base=0x000000000000007f, Begin=0x000000000000007f, Size=0, Type=0x120, Name=unknown Libomptarget --> Entry 5: Base=0x00007ffc42aa36a0, Begin=0x00007ffc42aa36a0, Size=32, Type=0x800, Name=unknown Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004651a0, Size=65536)... Target LEVEL0 RTL --> Ptr 0x00000000004651a0 is not a device accessible memory pointer. Target LEVEL0 RTL --> Allocated a device memory object 0x00003b7070b70000 Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0x00003b7070b70000, size = 262144, pool size = 262144 Target LEVEL0 RTL --> Allocated target memory 0x00003b7070b70000 (Base: 0x00003b7070b70000, Size: 65536) from memory pool for host ptr 0x00000000004651a0 Libomptarget --> Creating new map entry with HstPtrBegin=0x00000000004651a0, TgtPtrBegin=0x00003b7070b70000, Size=65536, RefCount=1, Name=unknown Libomptarget --> There are 65536 bytes allocated at target address 0x00003b7070b70000 - is new Libomptarget --> Moving 65536 bytes (hst:0x00000000004651a0) -> (tgt:0x00003b7070b70000) Target LEVEL0 RTL --> Copy Engine is used for data transfer Target LEVEL0 RTL --> Copied 65536 bytes (hst:0x00000000004651a0) -> (tgt:0x00003b7070b70000) Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004451a0, Size=65536)... Target LEVEL0 RTL --> Ptr 0x00000000004451a0 is not a device accessible memory pointer. Target LEVEL0 RTL --> Allocated target memory 0x00003b7070b80000 (Base: 0x00003b7070b80000, Size: 65536) from memory pool for host ptr 0x00000000004451a0 Libomptarget --> Creating new map entry with HstPtrBegin=0x00000000004451a0, TgtPtrBegin=0x00003b7070b80000, Size=65536, RefCount=1, Name=unknown Libomptarget --> There are 65536 bytes allocated at target address 0x00003b7070b80000 - is new Libomptarget --> Moving 65536 bytes (hst:0x00000000004451a0) -> (tgt:0x00003b7070b80000) Target LEVEL0 RTL --> Copy Engine is used for data transfer Target LEVEL0 RTL --> Copied 65536 bytes (hst:0x00000000004451a0) -> (tgt:0x00003b7070b80000) Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004551a0, Size=65536)... Target LEVEL0 RTL --> Ptr 0x00000000004551a0 is not a device accessible memory pointer. Target LEVEL0 RTL --> Allocated target memory 0x00003b7070b90000 (Base: 0x00003b7070b90000, Size: 65536) from memory pool for host ptr 0x00000000004551a0 Libomptarget --> Creating new map entry with HstPtrBegin=0x00000000004551a0, TgtPtrBegin=0x00003b7070b90000, Size=65536, RefCount=1, Name=unknown Libomptarget --> There are 65536 bytes allocated at target address 0x00003b7070b90000 - is new Libomptarget --> Moving 65536 bytes (hst:0x00000000004551a0) -> (tgt:0x00003b7070b90000) Target LEVEL0 RTL --> Copy Engine is used for data transfer Target LEVEL0 RTL --> Copied 65536 bytes (hst:0x00000000004551a0) -> (tgt:0x00003b7070b90000) Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004651a0, Size=65536)... Libomptarget --> Mapping exists with HstPtrBegin=0x00000000004651a0, TgtPtrBegin=0x00003b7070b70000, Size=65536, RefCount=1 (update suppressed) Libomptarget --> Obtained target argument (Begin: 0x00003b7070b70000, Offset: 0) from host pointer 0x00000000004651a0 Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004451a0, Size=65536)... Libomptarget --> Mapping exists with HstPtrBegin=0x00000000004451a0, TgtPtrBegin=0x00003b7070b80000, Size=65536, RefCount=1 (update suppressed) Libomptarget --> Obtained target argument (Begin: 0x00003b7070b80000, Offset: 0) from host pointer 0x00000000004451a0 Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004551a0, Size=65536)... Libomptarget --> Mapping exists with HstPtrBegin=0x00000000004551a0, TgtPtrBegin=0x00003b7070b90000, Size=65536, RefCount=1 (update suppressed) Libomptarget --> Obtained target argument (Begin: 0x00003b7070b90000, Offset: 0) from host pointer 0x00000000004551a0 Libomptarget --> Forwarding first-private value 0x0000000000000000 to the target construct Libomptarget --> Forwarding first-private value 0x000000000000007f to the target construct Libomptarget --> Launching target execution __omp_offloading_42_d74c0b80__Z7Computev_l25 with pointer 0x0000000001091720 (index=0). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x0000000001091720... Target LEVEL0 RTL --> Kernel argument 0 (value: 0x00003b7070b70000) was set successfully Target LEVEL0 RTL --> Kernel argument 1 (value: 0x00003b7070b80000) was set successfully Target LEVEL0 RTL --> Kernel argument 2 (value: 0x00003b7070b90000) was set successfully Target LEVEL0 RTL --> Kernel argument 3 (value: 0x0000000000000000) was set successfully Target LEVEL0 RTL --> Kernel argument 4 (value: 0x000000000000007f) was set successfully Target LEVEL0 RTL --> Setting indirect access flags 0x0000000000000000 Target LEVEL0 RTL --> Assumed kernel SIMD width is 16 Target LEVEL0 RTL --> Preferred group size is multiple of 32 Target LEVEL0 RTL --> Max group size is set to 128 (thread_limit clause) Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 127, Stride = 1 Target LEVEL0 RTL --> Group sizes = {128, 1, 1} Target LEVEL0 RTL --> Group counts = {1, 1, 1} Target LEVEL0 RTL --> Created a command list 0x0000000001d040c0 for device 0. Target LEVEL0 RTL --> Created a command queue 0x00000000011179e0 for device 0. Target LEVEL0 RTL --> Executed a kernel 0x0000000001091720 Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004551a0, Size=65536)... Libomptarget --> Mapping exists with HstPtrBegin=0x00000000004551a0, TgtPtrBegin=0x00003b7070b90000, Size=65536, RefCount=1 (deferred final decrement) Libomptarget --> There are 65536 bytes allocated at target address 0x00003b7070b90000 - is last Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004451a0, Size=65536)... Libomptarget --> Mapping exists with HstPtrBegin=0x00000000004451a0, TgtPtrBegin=0x00003b7070b80000, Size=65536, RefCount=1 (deferred final decrement) Libomptarget --> There are 65536 bytes allocated at target address 0x00003b7070b80000 - is last Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004651a0, Size=65536)... Libomptarget --> Mapping exists with HstPtrBegin=0x00000000004651a0, TgtPtrBegin=0x00003b7070b70000, Size=65536, RefCount=1 (deferred final decrement) Libomptarget --> There are 65536 bytes allocated at target address 0x00003b7070b70000 - is last Libomptarget --> Moving 65536 bytes (tgt:0x00003b7070b70000) -> (hst:0x00000000004651a0) Target LEVEL0 RTL --> Copy Engine is used for data transfer Target LEVEL0 RTL --> Copied 65536 bytes (tgt:0x00003b7070b70000) -> (hst:0x00000000004651a0) Target LEVEL0 RTL --> Ptr 0x00000000004551a0 is not a device accessible memory pointer. Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004551a0, Size=65536)... Libomptarget --> Deleting tgt data 0x00003b7070b90000 of size 65536 Target LEVEL0 RTL --> Returned device memory 0x00003b7070b90000 to memory pool Libomptarget --> Removing map entry with HstPtrBegin=0x00000000004551a0, TgtPtrBegin=0x00003b7070b90000, Size=65536, Name=unknown Target LEVEL0 RTL --> Ptr 0x00000000004451a0 is not a device accessible memory pointer. Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004451a0, Size=65536)... Libomptarget --> Deleting tgt data 0x00003b7070b80000 of size 65536 Target LEVEL0 RTL --> Returned device memory 0x00003b7070b80000 to memory pool Libomptarget --> Removing map entry with HstPtrBegin=0x00000000004451a0, TgtPtrBegin=0x00003b7070b80000, Size=65536, Name=unknown Target LEVEL0 RTL --> Ptr 0x00000000004651a0 is not a device accessible memory pointer. Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004651a0, Size=65536)... Libomptarget --> Deleting tgt data 0x00003b7070b70000 of size 65536 Target LEVEL0 RTL --> Returned device memory 0x00003b7070b70000 to memory pool Libomptarget --> Removing map entry with HstPtrBegin=0x00000000004651a0, TgtPtrBegin=0x00003b7070b70000, Size=65536, Name=unknown Target OPENCL RTL --> Closed RTL successfully Libomptarget --> Unloading target library! Target LEVEL0 RTL --> Target binary is a valid oneAPI OpenMP image. Libomptarget --> Image 0x00000000004021d0 is compatible with RTL 0x0000000000cd6f20! Libomptarget --> Unregistered image 0x00000000004021d0 from RTL 0x0000000000cd6f20! Libomptarget --> Done unregistering images! Libomptarget --> Removing translation table for descriptor 0x00000000004021b0 Libomptarget --> Done unregistering library! Libomptarget --> Deinit target library! Target LEVEL0 RTL --> Deinit Level0 plugin! Target LEVEL0 RTL --> Memory usage for host memory, device 0: Target LEVEL0 RTL --> -- Allocator: Native, Pool Target LEVEL0 RTL --> -- Requested: 0, 0 Target LEVEL0 RTL --> -- Allocated: 0, 0 Target LEVEL0 RTL --> -- Freed : 0, 0 Target LEVEL0 RTL --> -- InUse : 0, 0 Target LEVEL0 RTL --> -- PeakUse : 0, 0 Target LEVEL0 RTL --> -- NumAllocs: 0, 0 Target LEVEL0 RTL --> Memory usage for shared memory, device 0: Target LEVEL0 RTL --> -- Allocator: Native, Pool Target LEVEL0 RTL --> -- Requested: 0, 0 Target LEVEL0 RTL --> -- Allocated: 0, 0 Target LEVEL0 RTL --> -- Freed : 0, 0 Target LEVEL0 RTL --> -- InUse : 0, 0 Target LEVEL0 RTL --> -- PeakUse : 0, 0 Target LEVEL0 RTL --> -- NumAllocs: 0, 0 Target LEVEL0 RTL --> Memory usage for shared memory, device 1: Target LEVEL0 RTL --> -- Allocator: Native, Pool Target LEVEL0 RTL --> -- Requested: 0, 0 Target LEVEL0 RTL --> -- Allocated: 0, 0 Target LEVEL0 RTL --> -- Freed : 0, 0 Target LEVEL0 RTL --> -- InUse : 0, 0 Target LEVEL0 RTL --> -- PeakUse : 0, 0 Target LEVEL0 RTL --> -- NumAllocs: 0, 0 Target LEVEL0 RTL --> Memory usage for device memory, device 0: Target LEVEL0 RTL --> -- Allocator: Native, Pool Target LEVEL0 RTL --> -- Requested: 262144, 196608 Target LEVEL0 RTL --> -- Allocated: 262144, 196608 Target LEVEL0 RTL --> -- Freed : 262144, 196608 Target LEVEL0 RTL --> -- InUse : 0, 0 Target LEVEL0 RTL --> -- PeakUse : 262144, 196608 Target LEVEL0 RTL --> -- NumAllocs: 1, 3 Target LEVEL0 RTL --> Memory usage for device memory, device 1: Target LEVEL0 RTL --> -- Allocator: Native, Pool Target LEVEL0 RTL --> -- Requested: 0, 0 Target LEVEL0 RTL --> -- Allocated: 0, 0 Target LEVEL0 RTL --> -- Freed : 0, 0 Target LEVEL0 RTL --> -- InUse : 0, 0 Target LEVEL0 RTL --> -- PeakUse : 0, 0 Target LEVEL0 RTL --> -- NumAllocs: 0, 0 Target LEVEL0 RTL --> Closed RTL successfully Target OPENCL RTL --> Deinit OpenCL plugin! PASSED
Programming with an Intel® GPU is similar to programming with other GPUs. Different GPU (micro) architectures perform differently. Re-tuning a code for a new (micro) architecture is harder than functional migration. Intel is working towards providing compilers, libraries, and tools to reduce the burden of the latter, but this does not eliminate the need for performance optimizations.

Find More

Document
Description and links
OpenMP 5.0 and 5.1 specification PDFs
The OpenMP API 5.0 Specification and OpenMP API 5.1 Specification describe how OpenMP offloading can be used for devices.
The GNU C/C++ Library
SC'16 and SC'17 LLVM-HPC workshop papers on OpenMP support
LLVM Compiler Implementation for Explicit Parallelization and SIMD Vectorization.
LLVM-HPC@SC 2017: 4:1–4:11
LLVM Framework and IR Extensions for Parallelization, SIMD Vectorization and Offloading.
LLVM-HPC@SC 2016: 21–31

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.