Developer Guide


Tools to Analyze Performance of OpenMP Applications

There are various tools and mechanisms that are available that help in analyzing the performance of OpenMP programs and identifying bottlenecks.
Intel® VTune™Profiler
. Intel® Vtune Profiler can be used to analyze the performance of an application. It helps identify the most time-consuming (hot) functions in the application, whether the application is CPU- or GPU-bound, how effectively it offloads code to the GPU, and the best sections of code to optimize for sequential performance and for threaded performance, among other things. For more information about VTune Profiler, refer to the Intel® VTune™Profiler User Guide.
Level Zero Tracer
. The Level Zero Tracer (
) is a host and device tracing tool for Level Zero backend with support for DPC++ and OpenMP GPU offload. For information about this tool, see the Level Zero Tracer section of this document.
When using
with the
options, look at host- and device-side summaries at the end of the trace, under the headings “API Timing Results” and “Device Timing Results”, respectively.
Note that only explicit data transfers appear in the trace. Transfers of data allocated in Unified Shared Memory (USM) may not appear in the trace.
  • ze_tracer
    is useful for confirming that offloading of oneMKL kernels has occurred. The environment variable
    environment variable does not affect oneMKL, and therefore cannot be used to guarantee that offloading of oneMKL kernels has occurred. One way to check that offloading of oneMKL kernels (and other kernels) has occurred is to see which kernels are listed under “Device Timing Results” in the trace generated by
SYCL_PI_TRACE=2 environment variable
. The DPC++ Runtime Plugin Interface (PI) is an interface layer between the device-agnostic part of DPC++ runtime and the device-specific runtime layers which control execution on devices. Setting SYCL_PI_TRACE=2 provides a trace of all PI calls made with arguments and returned values. For more information, see the DPC++ Runtime Plugin Interface documentation.
LIBOMPTARGET_DEBUG=1 environment variable
. LIBOMPTARGET_DEBUG controls whether or not debugging information from will be displayed.
The debugging output provides useful information about things like ND-range partitioning of loop iterations, data transfers between host and device, memory usage, etc., as shown in the :Using More GPU Resources and :Minimizing Data Transfers and Memory Allocations sections of this document.
For more information about LIBOMPTARGET_DEBUG, see LLVM/OpenMP Runtimes.
. LIBOMPTARGET_PROFILE allows to generate time profile output. For more information, see LLVM/OpenMP Runtimes.
Dump of compiler-generated assembly for the device
. You can dump the compiler-generated assembly by setting the following two environment variables before doing Just-In-Time (JIT) compilation (or before running the program in the case of Ahead-Of-Time (AOT) compilation).
export IGC_ShaderDumpEnable=1 export IGC_DumpToCustomDir=my_dump_dir
LLVM IR, assembly, and GenISA files will be dumped in the sub-directory named
(or any other name you choose). In this sub-directory, you will find a
file for each kernel. The filename indicates the source line number on which the kernel occurs. The header of the file provides information about SIMD width, compiler options, as well as other information. Note that on ATS, ATS assembly will be generated; while on PVC, PVC assembly will be generated.
Also, in
, you will find an file named
that provides information about the GPU, such as EU count, thread count, slice count, etc.
For more information about the Intel® Graphics Compiler and a listing of available flags (environment variables) to control the compilation, see Intel® Graphics Compiler for OpenCL™Configuration Flags for Linux Release
For additional information about debugging and profiling, refer to the Debugging and Profiling section of this document.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at