How to Use Intel® oneAPI Math Kernel Library - Part One

ID 735082
Updated 6/15/2022
Version Latest
Public

author-image

By

Introduction

This article will show how to use Intel® oneAPI Math Kernel Library (oneMKL) BLAS functions in Data Parallel C++ (dpc++).  It will also show how to offload the computation to devices like CPUs and Intel® GPUs.

This article is the first of the oneMKL series.  Future articles will show how to use different oneMKL functions from LAPACK and from different domains like Vector Math, Fast Fourier Transform, (FFT), Random Number Generators (RNG) as well as offloading the computation to the GPUs using languages other than dpc++ like C.

Before continuing, let’s briefly talk about what dpc++ and oneMKL are all about.

What is DPC++?

Data Parallel C++ (DPC++) is the oneAPI Implementation of SYCL*.  It is an open and cross-architecture language and is designed for data parallel programming and heterogenous computing.

What is oneMKL?

Intel® oneAPI Math Kernel Library (oneMKL) is a high-performance math library that contains highly optimized, threaded and vectorized routines that can be used for scientific, engineering, and financial applications.
It Provides key functionality for dense and sparse linear algebra (BLAS, LAPACK, PARDISO), FFTs, vector math, summary statistics, splines and so on.
In addition, oneMKL automatically takes advantage of special features like Intel® Advanced Vector Extensions 512 (Intel® AVX-512) to improve the application performance.
More information about oneMKL can be found at the link in the reference section.

To use oneMKL, the following toolkits need to be downloaded and installed: 
Intel® oneAPI Base Toolkit.
https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html

At the time of this writing, the latest version of the Intel® oneAPI Base Toolkit is 2022.2.  It contains version 2022.1 of dpc++ compiler and oneMKL.

The following section will show how to check if a system supports GPUs.

How to Check if a System supports Intel® GPU

Use the command sycl-ls to see if the system supports GPU. At the command prompt type:

sycl-ls

You will see something similar to this:

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2022.13.3.0.16_160000]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz 3.0 [2022.13.3.0.16_160000]
[opencl:gpu:2] Intel(R) OpenCL HD Graphics, Intel(R) Iris(R) Xe Graphics 3.0 [30.0.101.1994]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x9a49] 1.3 [1.3.0]
[host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]

If you see opencl:gpu:2…  or level_zero:gpu… that means the system supports GPU.

For the list of supported GPUs, please see the oneMKL system requirement at:

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/oneapi-math-kernel-library-system-requirements.html

The next section will show how to offload computation to GPUs.

How to Use oneMKL functions in a Program

The example below shows how to do matrix multiply of float data type: 

C = alpha * op(A) * op(B) + beta * C

Where:

A: Matrix of m x k dimensions
B: Matrix of k x n dimensions
alpha, beta: Constant
Note that op() means that either or both matrices A and B can  be of the following three types: trans, nontrans, conjtrans.

1)    Include appropriate header files

#include <CL/sycl.hpp>
#include "oneapi/mkl/blas.hpp"
#include "mkl.h"

2)    Initialize data 

// Matrix data sizes
int m = …;
int n = …;
int k = …;
// Leading dimensions of data
int lda = k;
int ldb = n;
int ldc = n;

float alpha = …;
float beta = …;

// Initialize Matrices A, B and C...

Note that A, B and C will be stored as vectors (std::vector).

3)    Specify whether the matrices are transposed

In this case both matrices A and B are non-transpose:

oneapi::mkl::transpose transA = oneapi::mkl::transpose::nontrans;
oneapi::mkl::transpose transB = oneapi::mkl::transpose::nontrans;

4)    Add Exception handler 

The following exception handler will catch asynchronous exceptions

auto exception_handler = [] (cl::sycl::exception_list exceptions) {
	for (std::exception_ptr const& e : exceptions) {
		try {
			std::rethrow_exception(e);
		} catch(cl::sycl::exception const& e) {
			std::cout << "Caught asynchronous SYCL exception during GEMM:\n"
			<< e.what() << std::endl;
		}
	}
};

5)    Select devices to offload computation

cl::sycl::device dev;

If the desired device is GPU then use:

dev = cl::sycl::device(cl::sycl::gpu_selector());

If the device is CPU then use:

dev = cl::sycl::device(cl::sycl::cpu_selector());

6)    Create queue and buffers

// Create execution queue.
cl::sycl::queue q(dev, exception_handler);

// Create buffers to do the calculation.
cl::sycl::buffer<float, 1> bufA(A.data(), A.size());
cl::sycl::buffer<float, 1> bufB(B.data(), B.size());
cl::sycl::buffer<float, 1> bufC(C.data(), C.size());

Note that the member function data() point to the first element of the matrix while size() returns the number of elements of the matrix.

7)    Call oneMKL functions

	try {
		oneapi::mkl::blas::gemm(q, transA, transB, m, n, k, alpha, bufA, lda, bufB, ldb, beta, bufC, ldc);
	}
		catch(cl::sycl::exception const& e) {
		std::cout << "\t\tCaught synchronous SYCL exception during GEMM:\n"
			<< e.what() << std::endl << "OpenCL status: " << get_error_code(e) << std::endl;
	}

How to build and link to oneMKL

The following instructions used in this article are for Linux. It is highly recommended that the Intel® oneAPI Math Kernel Library Link Line Advisor be used to build and link the program to oneMKL.

The following will show how to build a program and dynamically link to oneMKL:

dpcpp -DMKL_ILP64	-qmkl=sequential <program.cpp> -L${MKLROOT}/lib/intel64 -lsycl -lOpenCL -lpthread -lm -ldl

Note that user will need to replace <program.cpp> with the actual program.

The above link line will link to the single-thread version of oneMKL.  To link to the multi-threaded version of oneMKL, use the following link line:

dpcpp -DMKL_ILP64	-qmkl=parallel <program.cpp> -L${MKLROOT}/lib/intel64 -lsycl -lOpenCL -lpthread -lm -ldl

If user decides to statically link to oneMKL then use the following link line:

dpcpp -DMKL_ILP64	-I"${MKLROOT}/include" <program.cpp> -fsycl-device-code-split=per_kernel	${MKLROOT}/lib/intel64/libmkl_sycl.a -Wl,-export-dynamic -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_tbb_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -L${TBBROOT}/lib/intel64/gcc4.8 -ltbb -lsycl -lOpenCL -lpthread -lm -ldl

Conclusion

Using dpc++ allows users to develop a single version of the code that can be run on either CPU or GPU by selecting which device to run on resulting in simplifying the maintaining process. Also, in general, using oneMKL functions not only reducing the development time of user’s application but also improving its performance.  In addition to that, oneMKL automatically takes advantage of special features like Intel® Advanced Vector Extensions 512 (Intel® AVX-512). Users don’t have to worry about manually enabling such kind of features.  They can concentrate on the functionality of their applications.

References

oneAPI link
https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html#gs.343v5t
oneMKL link
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html