Verbose Mode Supported in Intel® MKL

Gennady Fedorov

Introduction

Intel® oneAPI Math Kernel Library (Intel® oneMKL) version 2024.2 supports verbose mode in the following domains:

Verbose mode on CPU is enabled in:
- BLAS
- LAPACK
- FFT
- For the following ScaLAPACK functions
  - P?POTRF
  - P?TRTRI
  - PDSYEV{D, R, X}
  - PZHEEV{D, R, X}
- Vector Statistics
Verbose mode on GPU is enabled in:
- BLAS
- LAPACK
- FFT

This feature enables developers to better understand the Intel® oneMKL function run-time usage in their programs. Verbose mode support provides the ability to extract information related to the version of oneMKL used, the instruction set supported by the run-time processor on CPU, the GPU device on which a kernel is executed, the oneMKL functions called and the parameters passed to them, and the amount of time spent in each function call.

For cluster components like ScaLAPACK, all MPI ranks will print MKL_VERBOSE output.

Using Intel® oneMKL Verbose Mode

Intel® oneMKL Verbose has several modes: disabled (default), enabled with timing, and enabled without timing (only if the function is targeted to GPU).

To change the verbose mode, do one of the following:

set the environment variable MKL_VERBOSE

	CPU Targets	GPU Targets
(default) Set MKL_VERBOSE to 0	to disable verbose	to disable verbose
Set MKL_VERBOSE to 1	to enable verbose	to enable verbose without timing
Set MKL_VERBOSE to 2	to enable verbose	to enable verbose with synchronous timing

Or call the support function mkl_verbose(int mode)

	CPU Targets	GPU Targets
(default) Call mkl_verbose(0)	to disable verbose	to disable verbose
Call mkl_verbose(1)	to enable verbose	to enable verbose without timing
Call mkl_verbose(2)	to enable verbose	to enable verbose with synchronous timing

By default the verbose mode is disabled. When it is on, every call of a verbose-enabled function finishes with printing verbose log, including the list of version Information, the list of GPU devices available, the name of a function, values of the arguments, time is taken by the function and others (if the function is targeted to GPU and verbose is enabled without timing, the verbose log will print the time as 0).

Example 1: Using Verbose Mode for DGEMM targeted to CPU

The following is an example of calling the BLAS matrix*matrix function, dgemm(). Build the BLAS program and product binary. Before running the binary, set MKL_VERBOSE=1. The verbose information in the program will be shown up:

The version information line:

MKL_VERBOSE oneMKL 2022.0 Product build 20211022 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.60GHz ilp64 intel_thread

The information indicates that the current oneMKL version is 2022.0, the type of processor is Intel(R) AVX2 enabled, operating System is Linux, CPU Frequency is 2.60GHz, it is using ilp64 interface and oneMKL threading library.

Call description line:

MKL_VERBOSE DGEMM(T,T,5,2,4,0x7ffd39eba628,0x18b1b00,4,0x18a0900,2,0x7ffd39eba630,0x18b1c40,5) 23.93ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

The line show, the program is using DGEMM with the input parameter:

T,T,5,2,4,0x7ffd39eba628,0x18b1b00,4,0x18a0900,2,0x7ffd39eba630,0x18b1c40,5

It takes 23.93ms. The environment variable MKL_CBWR is OFF, MKL_DYNAMIC and FastMemory Manager is on. The print thread ID is 0. And the total used 4 threads.

Example 2: Using Verbose Mode for 2D complex FFT targeted to CPU

The following is an example of calling FFT functions. Build the FFT program and product binary. Before running the binary, set MKL_VERBOSE=1. The verbose information in the program will be shown up:

Version information line:

MKL_VERBOSE oneMKL 2022.0 Product build 20211022 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.60GHz intel_thread

Call Description Lines on CPU:

MKL_VERBOSE FFT(scfi13x7,tLim:1,desc:0x2250500) 39.42ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

FFT(scfi7x13,tLim:1,desc:0x514a0c0) is a functional dump of MKL FFT descriptor. The content is interpreted as:

scfi13x7, - is a problem description
s/d for Single/Double precision
c/r for Complex/Real forward domain
f/b for Forward/Backward compute direction
i/o for in-place/out-of-place output memory placement
13x7 for dimensions lengths go from the biggest^ dimension to smallest^ dimension, "x" is a delimiter between dimensions
^ smallest dimension means that transform points are located in memory the most dense way
tLim:1 is a DFTI_THREAD_LIMIT setting, a number of threads to be used in a run-time (if available) to compute the FFT problem. It uses and prints set up value or, if not set, adjust itself to a specific value to achieve the best performance on the given system
desc:0x2250500 is a handle address in memory

39.42ms is run time.

What also may appear in verbose out:

Problem description:

*16 for DFTI_NUMBER_OF_TRANSFORMS setting (or batch setting) when input distance between two transforms equals to multiplication of all dimension lengths, or so named standard memory layout; "*" separates a problem from a batch plus distances settings.

v512 for DFTI_NUMBER_OF_TRANSFORMS setting (or batch setting) when input distance between two transforms equals to 1, or so named compact memory layout; "v" separates a problem from a batch plus distances settings.

7:30:30x13:1:1 for non-standard strides. If strides differ from standard (in this particular case, the value of 13 is considered to be a standard, not 30), it will dump a full problem, which is <length>:<inputStride>:<outputStride> for each dimension goes from the biggest dimension to smallest dimension. If there's also a batch setting, batch size plus input and output distances will appear at the end of the problem description

fScale:x / bScale:x, are DFTI_FORWARD_SCALE/DFTI_BACKWARD_SCALE settings and reflect the value provided by the user. Default values of 1.0 for each setting are not printed.

pack: perm is a DFTI_PACKED_FORMAT setting. The default value of CCE format is not printed.

input: unaligned, is a check for input data alignment on a 64-byte boarder. If the data is aligned, this is not printed. The same is true for output memory. A case for out-of-place split complex (when DFTI_COMPLEX_STORAGE = DFTI_REAL_REAL) is not supported.

Else parameters that are allowed to be set from Intel(R) MKL FFT API will be printed if non-default values were used.

Example 3: Using Verbose Mode for 2D complex FFT targeted to GPU

The following is an example of calling FFT functions. Build the FFT program and product binary. Before running the binary, set MKL_VERBOSE=2. The verbose information in the program will be shown up:

Version information line:

MKL_VERBOSE oneMKL 2022.0 Product build 20211022 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.60GHz ilp64 sequential

GPU Information Line:

MKL_VERBOSE Detected GPU0 Intel(R)_Gen9 Backend:Level_Zero VE:72 Stack:1 maxWGsize:256

This line shows that an Intel(R) Gen9 GPU was detected. It has Level Zero backend, 72 vector engines, 1 stack, and its maximum workgroup size is 256. And it will be referred to as “GPU0” in the subsequent Call Description Lines.

Call Description Lines on GPU

MKL_VERBOSE FFT(scfi13x6,bScale:0.0128205) 123.47ms GPU0

MKL_VERBOSE FFT(scbi13x6,bScale:0.0128205) 178.13ms GPU0

MKL_VERBOSE FFT(scfi13x6,bScale:0.0128205) 270.44us GPU0

MKL_VERBOSE FFT(scbi13x6,bScale:0.0128205) 188.50us GPU0

These lines show that the FFT functions called are targeted to GPU0.

MKL Verbose TOOLKIT

An Argonne National Laboratory researcher has written a parsing tool to summarize MKL_VERBOSE output. The tool can be very useful for customers who need a summary of many MKL calls and their statistics. The link to the GitHub is https://github.com/TApplencourt/mkl-verbose-toolkit

Some Limitations:

Because every call to a verbose-enabled function requires an output operation, the performance of the application may degrade with the verbose mode enabled.
Besides of this, oneMKL Verbose mode has the following limitations:

On GPU, if verbose is enabled with timing, kernels will be executed synchronously (previous kernel will block later kernels).
The call description lines may be printed out-of-order (order of the call description lines printed in the verbose output may not be the same order in which the kernels are submitted in the functions) for the following two cases:
- if verbose is enabled without timing and the kernel executions stay asynchronous
- if kernel is not executed on one of the GPU devices, but on the host CPU
Input values of parameters passed by reference are not printed if the values were changed by the function. For example, if a LAPACK function is called with a workspace query, that is, the value of the lwork parameter equals -1 on input, the call description line prints the result of the query and not -1.
Return values of functions are not printed. For example, the value returned by the function ilaenv is not printed.
Floating-point scalars passed by reference are not printed.

Please see the oneMKL Developer Guide for more details about the verbose mode of oneMKL.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Verbose Mode Supported in Intel® oneMKL

Introduction

Using Intel® oneMKL Verbose Mode

Example 1: Using Verbose Mode for DGEMM targeted to CPU

The version information line:

Call description line:

Example 2: Using Verbose Mode for 2D complex FFT targeted to CPU

Version information line:

Call Description Lines on CPU:

Example 3: Using Verbose Mode for 2D complex FFT targeted to GPU

Version information line:

GPU Information Line:

Call Description Lines on GPU

MKL Verbose TOOLKIT

Some Limitations:

Product and Performance Information