How to use the Intel® oneAPI Math Kernel Library (oneMKL) API for CPU, GPU, and Other Accelerations

Introduction 

Programming across diverse architectures (CPU, GPU, FPGA and other accelerators) is crucial in supporting many current workloads where multiple architectures are required and will become the norm. Today, each hardware platform typically requires developers to maintain separate code bases that need to be programmed using different languages, libraries and software tools. This is complex and time consuming and slows acceleration and innovation.

Intel® oneAPI Math Kernel Library (oneMKL) [1] addresses this challenge by delivering a unified application programming interfaces (API) for the list of the most widely used math routines. The list of currently supported routines are documented in the Intel® oneAPI Math Kernel Library Data Parallel C++ Developer Reference [2]. oneMKL version 2021 introduces the following:

  • Intel® oneAPI Data Parallel C++ APIs to maximize application performance and cross-architecture portability
  • C/C++/Fortran OpenMP offload feature support for Intel® GPU acceleration
  • Continues support for All Intel® MKL functionality, including industry-standard C and Fortran APIs for compatibility with popular BLAS, LAPACK, FFTW, and many others.

This article shows the typical steps you can perform for migration from the Intel® MKL routines to the oneMKL routines. The article describes these steps on the examples from the most frequently used matrix-matrix multiplication routine (DGEMM).

This article does not describe all the differences between the standard and DPC++ oneMKL APIs,however, you can find these details in the oneMKL Introduction [3].

oneMKL Data Parallel C++ API

The typical steps for C and DPC++ API of DGEMM routine steps are as following

DPC++ API with SYCL Buffers

C API

#include “oneapi/mkl.hpp”

// Set up device and queue

sycl::device dev(sycl::{host, cpu, gpu}_selector());

sycl::queue Q(dev);

 

//Declaration

int64_t m = 16, n = 20, k = 24;

int64_t lda = m, ldb = n, ldc = m

 

// Allocate A, B, C

double *A_host = ..., *B_host = ..., *C_host = ...;

 

// Prepare the SYCL buffers

sycl::buffer<double, 1>

            A (A_host, sycl::range<1>(lda * k)),

            B (B_host, sycl::range<1>(ldb * n)),

            C (C_host, sycl::range<1>(ldc * n));

 

// Compute C = A * B

oneapi::mkl::blas::gemm(Q,

                mkl::transpose::N, mkl::transpose::N,

                m, n, k, alpha, A, lda, B, ldb,

                beta, C, ldc);

#include “mkl.h”

 

 

 

 

 

//Declaration

int m = 16, n = 20, k = 24;

int lda = m, ldb = n, ldc = m;

 

// Allocate A, B, C

double *A = ..., *B = ..., *C = ...;

 

 

 

 

 

 

// Compute C = A * B

cblas_dgemm(CblasColMajor,

                CblasNoTrans, CblasNoTrans,

                m, n, k, alpha, A, lda, B, ldb,

                beta, C, ldc);

 

 

In the case where Unified Shared Memory (USM) pointer interfaces are used, the calling pipeline would look like the example with SYCL Buffers; however, instead of SYCL Buffers, you must allocate USM arrays as follows:

     auto A = (double *) sycl::malloc_shared(lda * k * sizeof(double), dev, Q.get_context());

    auto B = (double *) sycl::malloc_shared(ldb * n * sizeof(double), dev, Q.get_context());

    auto C = (double *) sycl::malloc_shared(ldc * n * sizeof(double), dev, Q.get_context());

Here is another example which shows the C and DPC++ API of the Vector Mathematical (VM) sin() routine calls. Notice the differences between how the USM pointers are used when compared to the SYCL buffers in the DGEMM example:

DPC++ API with USM

C API

#include “oneapi/mkl.hpp”

// Set up device, exception handler and queue

sycl::device dev {gpu_selector()};

sycl::queue Q {dev};

 

// Allocate A, R in shared memory

float* A = sycl::malloc_shared(sizeof(float)*N,

                                          dev, Q.get_context());

float* R = sycl::malloc_shared(sizeof(float)*N,

                                          dev, Q.get_context());

 

// Compute VM sin() on GPU

oneapi::mkl::vm::sin(Q,

                       N, A, R,

                       nullptr, oneapi::mkl::vm::mode::la);

Q.wait();

#include “mkl.h”

 

 

 

 

 

// Allocate A, R on host

 float A[N] = {0}, R[N];

 

 

 

 

// Calculate VM sin() on host

vmsSin (N, A, R, VML_LA);

 

 

You will find many DPC++ BLAS, LAPACK, FFT, RNG and Sparse BLAS API examples which contain all the requirement details in the oneMKL SYCL and OpenMP Offload Examples.

oneMKL Data Parallel C++ Exception Handling

oneMKL DPC++ API introduces an exception handling mechanism. Exception classification in oneMKL is aligned with the C++ Standard Library classification. oneMKL introduces a class that defines the base class in the hierarchy of oneMKL exception classes. All oneMKL routines throw exceptions inherited from this base class. In the hierarchy of oneMKL exceptions, oneapi::mkl::exception is the base class inherited from the std::exception class. All other oneMKL exception classes are derived from this base class [4].

The DPC++ language also provides error handling for asynchronous exceptions that could be supported in the following manner:


auto exception_handler = [](sycl::exception_list exceptions) {

for (std::exception_ptr const& e : exceptions) {

try {

std::rethrow_exception(e);

}

catch (sycl::exception const& e) {

std::cout << "Caught asynchronous SYCL exception:\n"

<< e.what() << std::endl;

}

}

};

// Choose device to run on and create queue

sycl::gpu_selector selector;

sycl::queue queue(selector, exception_handler);

The exception handler is passed in the sycl::queue constructor.

The main DPC++ computational part can be wrapped into the try-catch block to handle DPC++ runtime and oneMKL library exceptions.


try {
//DPC++ computational part
}

catch (oneapi::mkl::exception const& e) {
std::cout << "\t oneMKL exception \n" << e.what() << std::endl;
}

catch (sycl::exception const& e) {
std::cout << "\t SYCL exception \n" << e.what() << std::endl;
}

OpenMP Offload Feature

For developers who prefer C/Fortran, oneMKL introduces OpenMP Offload feature support for these APIs which allows you to offload oneMKL C/Fortran function execution to the GPU.

This feature supports all the widely used mathematical routines from oneMKL including all of  BLAS, LAPACK, and FFT.

oneMKL CBLAS_SGEMM OpenMP offload example:

 

#include "mkl.h"
#include "mkl_omp_offload.h"
…

//allocation and then initialization all working arrays. e.x. A:
float* A     = (float *) mkl_malloc ((lda * k) * sizeof(float), 64);
....

//initializations all working arras, e.x. – array C:
#pragma omp target map(C[0:sizeC])
{
   Initialize C
}

// run sgemm on Host, use standard oneMKL interface: 
cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c_ref, ldc);

#pragma omp target data map(to:A[0:sizea],B[0:sizeb]) map(tofrom:C[0:sizec]) device(dnum)

{
// run sgemm on GPU, use standard oneMKL interface within a variant dispatch construct
#pragma omp target variant dispatch device(dnum) use_device_ptr(A,B,C)
{
cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);

}
}

//deallocation
mkl_free(A);
...

 

See the C OpenMP Offload Developer Guide for more details about this feature [6].

The same calling pipeline for the Fortran API:

oneMKL FORTRAN OpenMP offload examples for SGEMM

include "mkl_omp_offload.f90"

…..

! allocation and then initialization all working arrays. E.x. A:

allocate(A(lda,k))



! Calling SGEMM on the CPU

call sgemm (transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc)



! Calling sgemm on the GPU

!$omp target data map(A,B,C)

!$omp target variant dispatch use_device_ptr(A,B,C)



call sgemm(transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc)



!$omp end target variant dispatch

!$omp end target data



!deallocation

deallocate(A)

 …

Building and Linking with oneMKL

To configure the link command according to your specific program, use the Intel® MKL Link Line Advisor [5], or as an example, refer to Get Started with Intel® oneAPI Math Kernel Library [7].

 

Here are some typical building and linking examples on Linux OS*.

DPC++ interfaces with static linking on Linux:

dpcpp -fsycl-device-code-split=per_kernel -DMKL_ILP64 <typical user includes and linking flags and other libs> ${MKLROOT}/lib/intel64/libmkl_sycl.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_<sequential|tbb_thread>.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -ldl -lm

 

DPC++ interfaces with dynamic linking on Linux:

dpcpp -DMKL_ILP64 <typical user includes and linking flags and other libs> -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_<sequential|tbb_thread> -lmkl_core -lsycl -lOpenCL -lpthread -ldl -lm

 

C/Fortran Interfaces with OpenMP Offload Support:

icx -fiopenmp -fopenmp-targets=spir64 -fsycl  -DMKL_ILP64 -m64 -I$(MKLROOT)/include main.cpp L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lsycl -lOpenCL -lstdc++ -lpthread -lm -ldl

 

Authors:

Kraynyuk, Maria maria.kraynyuk@intel.com

Fedorov, Gennady Gennady.Fedorov@intel.com

 

References:

[1] Intel® oneAPI Math Kernel Library

[2] oneAPI Math Kernel Library - Data Parallel C++ Developer Reference

[3] oneMKL Introduction

[4] oneMKL Specification

[5] Intel® MKL Link Line Advisor

[6] C OpenMP Offload Developer Guide

[7] Get Started with Intel® oneAPI Math Kernel Library

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.