# Executing Intel® oneAPI Math Kernel Library Routines on GPU Using OpenMP* Offload

Last Updated: 12/04/2020

### Introduction

The Intel® oneAPI Math Kernel Library (oneMKL) is a set of optimized math functions that allow speed up of compute-intense applications in the areas of science, engineering, or financial applications. oneMKL can be used to optimize code for current and future generations of Intel® CPUs and GPUs and other accelerators. This article will show how to offload the computation to Intel GPUs using OpenMP* offload in C and Fortran.

### Requirements

This page contains hardware and software requirements to run oneMKL. To get oneMKL, please download and install the Intel® oneAPI Base Toolkit. In order to use OpenMP* offload, the Intel® oneAPI HPC Toolkit is also needed.

### Examples

This section will show how use oneMKL matrix multiply function with OpenMP* offload using C and Fortran.

#### C Implementation

#include<stdio.h>
#include<omp.h>
#include "mkl.h"
int dnum = 0;
int main(){
double *a, *b, *c, alpha, beta;
MKL_INT m, n, k, lda, ldb, ldc, i, j;

alpha = 1.0;
beta = 1.0;

m = 1000;
n = 1000;
k = 1000;

lda = m;
ldb = k;
ldc = m;

// allocate matrices
a = (double *)mkl_malloc((lda * k) * sizeof(double), 64);
b = (double *)mkl_malloc(ldb * n * sizeof(double), 64);
c = (double *)mkl_malloc(ldc * n * sizeof(double), 64);

if ((a == NULL) || (b == NULL) || (c == NULL)) {
printf("Cannot allocate matrices\n");
return 1;
}

// initialize matrices

for (MKL_INT i = 0; i < lda * k; i++) {
a[i] = 1.1;
}

for (MKL_INT i = 0; i < ldb * n; i++) {
b[i] = 2.2;
}

for (i = 0; i < sizec; i++) {
c[i] = 2.0;
}

#pragma omp target data map(to:a[0:sizea],b[0:sizeb]) map(tofrom:c[0:sizec]) device(dnum)
{
#pragma omp target variant dispatch device(dnum) use_device_ptr(a, b, c)
{
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
}
}

mkl_free(a);
mkl_free(b);
mkl_free(c);

return 0;
}


Explanation:

#include <omp.h>
#include "mkl.h"
#include "mkl_omp_offload.h"

These include files are needed for oneMKL routines and OpenMP* Offload.

int dnum = 0;

This is the device ID for the GPU.

double *a, *b, *c, alpha, beta;
MKL_INT m, n, k, lda, ldb, ldc, i, j;
MKL_INT sizea, sizeb, sizec;

alpha = 1.0;
beta = 1.0;

m = 1000;
n = 1000;
k = 1000;

lda = m;
ldb = k;
ldc = m;

sizea = lda * k;
sizeb = ldb * n;
sizec = ldc * n;

Create and initialize parameters for the oneMKL cblas_gemm routine.

a = (double *)mkl_malloc((lda * k) * sizeof(double), 64);
b = (double *)mkl_malloc(ldb * n * sizeof(double), 64);
c = (double *)mkl_malloc(ldc * n * sizeof(double), 64);

if ((a == NULL) || (b == NULL) || (c == NULL)) {
printf("Cannot allocate matrices\n");
return 1;
}

Allocate memories for matrices a, b and c.

for (MKL_INT i = 0; i < lda * k; i++) {
a[i] = 1.1;
}

for (MKL_INT i = 0; i < ldb * n; i++) {
b[i] = 2.2;
}

for (i = 0; i < sizec; i++) {
c[i] = 2.0;
}

Initialize those matrices.

#pragma omp target data map(to:a[0:sizea],b[0:sizeb]) map(tofrom:c[0:sizec]) device(dnum)
{
#pragma omp target variant dispatch device(dnum) use_device_ptr(a, b, c)
{
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
}
}

Map data to the device indicated by device(dnum). Note that “to” means the device will read from that data (matrices a and b)  while “tofrom” (matrix c) means after the device can read from and write to.  The device will write the result to matrix c before sending it back to host.

The pragma “ omp target variant…” will tell the device to execute whatever code following it.  In this case, it tells the device to execute the oneMKL routine cblas_dgemm.  If the device is not found, it will default back to the host device.

mkl_free(a);
mkl_free(b);
mkl_free(c);


Release the memory used by the oneMKL routine.

#### Fortran Implementation

include "mkl_omp_offload.f90"
include "common_blas.f90"

program dgemm_example

character*1 :: transa = 'N', transb = 'N'
integer :: i, j, m = 5, n = 3, k = 10
integer :: lda, ldb, ldc
double precision :: alpha = 1.0, beta = 1.0
double precision,allocatable :: a(:,:), b(:,:), c(:,:)

lda = m
ldb = k
ldc = m

allocate(a(lda,k))
allocate(b(ldb,n))
allocate(c(ldc,n))

if (.not. allocated(a)) goto 998
if (.not. allocated(b)) then
deallocate(a)
goto 998
end if
if (.not. allocated(c)) then
deallocate(a)
deallocate(b)
goto 998
end if

! initialize matrices
call dinit_matrix(transa, m, k, lda, a)
call dinit_matrix(transb, k, n, ldb, b)
call dinit_matrix('N', m, n, ldc, c)

integer :: i, j

do i = 1, m
do j = 1, k
a(i, j) = 1.1
end do
end do

do i = 1, k
do j = 1, n
b(i, j) = 2.2
end do
end do

do i = 1, m
do j = 1, n
c(i, j) = 2.2
end do
end do

! Calling dgemm on the GPU
!$omp target data map(a,b,c) !$omp target variant dispatch use_device_ptr(a,b,c)
call dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
!$omp end target variant dispatch !$omp end target data

deallocate(a)
deallocate(b)
deallocate(c)

stop

998 print *, 'Error: cannot allocate matrices'
999 stop 1
end program

Explanation:
It is very similar to the C implementation.
This portion of the code maps the data to the device and execute the oneMKL routine in the device:

! Calling dgemm on the GPU
!$omp target data map(a,b,c) !$omp target variant dispatch use_device_ptr(a,b,c)
call dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
!$omp end target variant dispatch !$omp end target data

##### For C

The Intel® oneAPI DPC++/C++ Compiler (icx) is needed for OpenMP* offload.The following shows how to build and dynamic link in sequential mode to oneMKL.

icx -fiopenmp -fopenmp-targets=spir64  -DMKL_ILP64  -I"${MKLROOT}/include" -m64 <your code> -fiopenmp -fopenmp-targets=spir64 -fsycl -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lsycl -lOpenCL -lstdc++ -lpthread -lm -ldl -o <Executable file>
##### For Fortran

The Intel® Fortran Compiler (Beta) (ifx) is needed for OpenMP* offload. The following shows how to build and dynamic link in parallel (OpenMP*) mode to oneMKL.

ifx -i8 -fiopenmp -fopenmp-targets=spir64  -DMKL_ILP64  -I"${MKLROOT}/include" -m64 -fpp <your code> -fiopenmp -fopenmp-targets=spir64 -fsycl -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lsycl -lOpenCL -lstdc++ -lpthread -lm -ldl -o <executable file>

#### Conclusion

OpenMP* offload allows applications written in C and Fortran to run on accelerators other than the Intel® CPUs.  This is very important since there are a lot of legacy applications out there implemented in C and Fortran.  It would normally be not practical to rewrite a large application from scratch using a different language in order to run on GPUs.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex​.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates.  See backup for configuration details.  No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation.  Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.

#### Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.