oneMKL Verbose Mode: Quick and Easy GPU Library Execution Profiler

Get the Latest on All Things CODE

author-image

By

Story at a Glance

 

  • oneMKL Verbose Mode enables developers to debug and understand oneMKL runtime behavior of applications.

  • The feature is now available for GPUs.

  • Key benefit: For quick debug and profiling of oneMKL-related issues, you don’t need to install any additional tools.


What is oneMKL Verbose Mode?

The Intel® oneAPI Math Kernel Library (oneMKL) Verbose Mode plays the role of a light-weight profiler distributed as part of the oneMKL Library. It enables developers to better understand the oneMKL runtime usage in their programs.

Just set the MKL_VERBOSE environment variable to the desired mode shown below in Table 1 (or call the support function mkl_verbose if you are in interactive execution mode). A complete comprehensive list of possible verbose mode settings can be found in the oneMKL documentation. Once set, the application will print out a lot of very useful information, including:

  • the oneMKL functions called
  • parameters passed to oneMKL functions
  • execution time of oneMKL functions
  • configuration settings used by function calls: threading type, interfaces, etc.

Key Benefit

oneMKL Verbose provides a great option to debug and understand oneMKL runtime behavior of applications.

Using oneMKL Verbose, we can understand:

  • Where a function was called incorrectly
  • Whether the size of a task in not optimal
  • Why is the execution time is longer than expected

Information from oneMKL Verbose helps to speed up analysis of customer submitted issues, reducing time to achieve workarounds and fixes.

Now Also Available on GPU

oneMKL Verbose Mode has been available for Intel® CPUs for BLAS, LAPACK, FFT, and ScaLAPACK functions. Now, we are also bringing this convenient feature to GPUs.

We are introducing oneMKL GPU Verbose mode with support for BLAS, LAPACK and FFT functions. In addition to all the information about function calls, passed parameters and function execution time, it also gives details about the device on which the kernel is executed.

The key advantage of oneMKL Verbose mode on GPU, just like on CPU, is no additional tools are required to get this debug information from oneMKL calls. Without the need of a dedicated debug tool, oneMKL Verbose offers a wide range of useful information to understand the root cause of runtime problems in many situations.

Simple, Efficient, Comprehensive, Concise

Let us look at the benefits and usage of oneMKL GPU Verbose mode in more detail.

In the example below, we are running oneMKL BLAS GEMM example code included in the oneMKL distribution. It can be found in the directory tree of your default Intel oneAPI installation at:

/opt/intel/oneapi/mkl/latest/examples/examples_dpcpp.tgz

The verbose output shown was created by running oneMKL GEMM function calls on the Intel® Iris® Xe MAX Graphics processor.

The available verbose reporting modes can be seen in Table 1.

To enable oneMKL GPU Verbose, we use “export MKL_VERBOSE=<mode>” to set the environment variable defining the desired level of reporting, where the <mode> can be:

Mode

CPU application

GPU application

0

to disable Verbose

to disable Verbose

1

to enable Verbose

to enable Verbose without timing

2

to enable Verbose

to enable Verbose with synchronous timing

Table 1. MKL_Verbose Reporting Mode Settings

Many additional resources and examples can be found at the oneAPI Samples GitHub Repository.

Figure 1. oneMKL Samples on the oneAPI Samples GitHub

Understanding the Output

Here is the output for MKL_VERBOSE=1:

‘Running tests on GPU.
Running with half precision real data type:
MKL_VERBOSE oneMKL 2023.0 Product build 20221107 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 3.60GHz ilp64 tbb_thread
MKL_VERBOSE Detected GPU0 Intel(R)_Xe_LP Backend:Level_Zero VE:96 Stack:1 maxWGsize:512
 
MKL_VERBOSE oneapi::mkl::blas::column_major::gemm[half](0x7ffec86694a0,Transpose,NonTranspose,79,83,91,2,0x7ffec8669410,103,0x7ffec86696c8,105,3,0x7ffec86696a0,106,unset) mode:standard host:nan device:nan GPU0
     [ … ]
MKL_VERBOSE oneapi::mkl::blas::column_major::gemm[float](0x7ffec86694a0,Transpose,NonTranspose,79,83,91,2,0x7ffec8669410,103,0x7ffec86696c8,105,3,0x7ffec86696a0,106,unset) mode:standard host:nan device:nan GPU0

All MKL_VERBOSE information starts with MKL_VERBOSE, so it is easy to find this output in logs. The oneMKL Verbose output consists of two parts of information.

The first part is the general information about oneMKL product and used oneMKL configurations, processor and GPU card details:

MKL_VERBOSE oneMKL 2023.0 Product build 20221107 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 3.60GHz ilp64 tbb_thread
MKL_VERBOSE Detected GPU0 Intel(R)_Xe_LP Backend:Level_Zero VE:96 Stack:1 maxWGsize:512
  • What oneMKL product version and build we have (oneMKL 2023.0 Product build 20221107). This is very important information to understand where a potential issue (or fix) may have been introduced and what oneMKL version is used in the run.
  • What processor and operating system are used (e.g., 64-bit Intel® Xeon® Processor with Intel® Advanced Vector Extensions 2, Lnx 3.60GHz)
  • What oneMKL interface is used (ilp64)
  • What type of threading is started (tbb_thread)
  • What GPU card was detected on the host (Detected GPU0 Intel_Xe_LP)
  • What type of backend is used in the run (Backend:Level_Zero)
  • How many vector engines the GPU Card has (VE:96)
  • Information about available stacks (Stack:1)
  • Information about Maximum Workgroup Size for this stack (maxWGsize:512)

The second part tells us more about the called functions:

MKL_VERBOSE oneapi::mkl::blas::column_major::gemm[half](0x7ffec86694a0,Transpose,NonTranspose,79,83,91,2,0x7ffec8669410,103,0x7ffec86696c8,105,3,0x7ffec86696a0,106,unset) mode:standard host:nan device:nan GPU0

We can see:

  • What function is called (oneapi::mkl::blas::column_major::gemm[half])
  • What parameters are sent to the function: ((0x7ffec86694a0,Transpose,NonTranspose,79,83,91,2,0x7ffec8669410,103,0x7ffec86696c8,105,3,0x7ffec86696a0,106,unset))

Detailed information about the available parameters for GEMM function calls can be found as part of the open oneMKL API Specification

If we set MKL_VERBOSE=2, we can also get information about the execution time on the host CPU: (host:663.18ms): 

MKL_VERBOSE oneapi::mkl::blas::column_major::gemm[half](0x7ffe33741600,Transpose,NonTranspose,79,83,91,2,0x7ffe33741570,103,0x7ffe33741820,105,3,0x7ffe337417f8,106,unset) mode:standard host:663.18ms device:nan GPU0

Export Execution Time

Additionally, we can export the execution time on the GPU offload device, but we need to apply a small change in the C++ SYCL* queue property settings source.

Let us again use the oneMKL BLAS gemm example at  ../mkl/examples/dpcpp/sources/gemm.cpp in your oneMKL installation.

Here is how we create a queue:

    sycl::queue main_queue(dev, exception_handler);

If we want to get execution time from the device, we need to additionally have the following enabled:

    sycl::property::queue::enable_profiling

With these settings, the SYCL runtime captures profiling information for command groups submitted to the queue.

So, the new line to enable this will be:

    sycl::queue main_queue(dev, exception_handler, sycl::property::queue::enable_profiling());

With SYCL profiling enabled, we get details about execution time on a given device (device:71.76us):

MKL_VERBOSE oneapi::mkl::blas::column_major::gemm[complex<float>](0x7ffd5b690de0,Transpose,NonTranspose,79,83,91,(2,-0.5),0x7ffd5b690d58,103,0x7ffd5b691008,105,(3,-1.5),0x7ffd5b690fe0,106,unset) mode:standard host:327.29ms device:71.76us GPU0 

This feature reports time spent executing on a device for BLAS and LAPACK functions only at this time. More device specific reporting may be added in the future.

The beginning of our MKL_VERBOSE defines GPU0 - Detected GPU0 Intel(R)_Xe_LP Backend:Level_Zero VE:96 Stack:1 maxWGsize:512.

Whenever subsequent other function calls reference GPU0 as the execution target, that means that our function is executed on the GPU0 device which MKL_VERBOSE detected and printed at the beginning of the application run.

Note: Please keep in mind that enabling MKL_VERBOSE, will affect the performance of the application since additional output generating function calls are being executed and oneMKL executes SYCL kernels synchronously. It is not recommended to have MKL_VERBOSE enabled for performance testing.

Test Drive oneMKL Verbose Mode Today

Take advantage of oneMKL Verbose execution profiling and reporting on your GPU offload code as a first step before resorting to more complex methods of debugging.

Frequently you will find that oneMKL’s verbose output is all you need to identify the root cause for failed execution or less-than-ideal performance. You can then apply the necessary changes and get the underlying problem solved without using more time-consuming and advanced debug options.

Related Content

Get the Software

Access oneMKL Verbose Mode as part of the Intel oneAPI Math Kernel Library standalone, or as part of the Intel® oneAPI Base Toolkit.


About the Author

Anna Olshanskaia is a Math Algorithm Engineer, who works on Intel® oneAPI Math Kernel Library (oneMKL), with expertise on threading/memory management, maintaining build/test systems, library design and architecture.

Previously, Anna worked as an Infrastructure and DevOps engineer for Intel Performance Libraries (oneMKL, oneDAL, oneDNN, Intel® IPP), where she worked on CI/CD processes and tools, supported product releases, enabled and maintained the products’ build/test infrastructures.