Executing Intel® oneAPI Math Kernel Library Routines on GPU Using...

Abhinav Singh, Khang T Nguyen

Introduction

The Intel® oneAPI Math Kernel Library (oneMKL) is a set of optimized math functions that allow speed up of compute-intense applications in the areas of science, engineering, or financial applications. oneMKL can be used to optimize code for current and future generations of Intel® CPUs and GPUs and other accelerators. This article will show how to offload the computation to Intel GPUs using OpenMP* offload in C and Fortran.

Requirements

This page contains hardware and software requirements to run oneMKL. To get oneMKL, please download and install the Intel® oneAPI Base Toolkit. In order to use OpenMP* offload, the Intel® oneAPI HPC Toolkit is also needed.

Examples

This section will show how use oneMKL matrix multiply function with OpenMP* offload using C and Fortran.

C Implementation


#include<stdio.h>
#include<omp.h>
#include "mkl.h"
#include "mkl_omp_offload.h"
int dnum = 0;
int main(){
    double *a, *b, *c, alpha, beta;
    MKL_INT m, n, k, lda, ldb, ldc, i, j;

    alpha = 1.0;
    beta = 1.0;

    m = 1000;
    n = 1000;
    k = 1000;

    lda = m;
    ldb = k;
    ldc = m;

   
    // allocate matrices
    a = (double *)mkl_malloc((lda * k) * sizeof(double), 64);
    b = (double *)mkl_malloc(ldb * n * sizeof(double), 64);
    c = (double *)mkl_malloc(ldc * n * sizeof(double), 64);

    if ((a == NULL) || (b == NULL) || (c == NULL)) {
        printf("Cannot allocate matrices\n");
        return 1;
    }

    // initialize matrices
  
    for (MKL_INT i = 0; i < lda * k; i++) {
       a[i] = 1.1;
    }
    
    for (MKL_INT i = 0; i < ldb * n; i++) {
       b[i] = 2.2;
    }

    for (i = 0; i < sizec; i++) {
            c[i] = 2.0;
    }

#pragma omp target data map(to:a[0:sizea],b[0:sizeb]) map(tofrom:c[0:sizec]) device(dnum)
{
 #pragma omp target variant dispatch device(dnum) use_device_ptr(a, b, c)
 {
   cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
 }
}
    
    mkl_free(a);
    mkl_free(b);
    mkl_free(c);
    
    return 0;
}

Explanation:


#include <omp.h>
#include "mkl.h"
#include "mkl_omp_offload.h"

These include files are needed for oneMKL routines and OpenMP* Offload.


int dnum = 0;

This is the device ID for the GPU.


double *a, *b, *c, alpha, beta;
MKL_INT m, n, k, lda, ldb, ldc, i, j;
MKL_INT sizea, sizeb, sizec;
    
alpha = 1.0;
beta = 1.0;

m = 1000;
n = 1000;
k = 1000;

lda = m;
ldb = k;
ldc = m;

sizea = lda * k;
sizeb = ldb * n;
sizec = ldc * n;

Create and initialize parameters for the oneMKL cblas_gemm routine.


a = (double *)mkl_malloc((lda * k) * sizeof(double), 64);
b = (double *)mkl_malloc(ldb * n * sizeof(double), 64);
c = (double *)mkl_malloc(ldc * n * sizeof(double), 64);

if ((a == NULL) || (b == NULL) || (c == NULL)) {
    printf("Cannot allocate matrices\n");
    return 1;
}

Allocate memories for matrices a, b and c.


for (MKL_INT i = 0; i < lda * k; i++) {
     a[i] = 1.1;
}
    
for (MKL_INT i = 0; i < ldb * n; i++) {
     b[i] = 2.2;
}

for (i = 0; i < sizec; i++) {
     c[i] = 2.0;
}

Initialize those matrices.


#pragma omp target data map(to:a[0:sizea],b[0:sizeb]) map(tofrom:c[0:sizec]) device(dnum)
{
  #pragma omp target variant dispatch device(dnum) use_device_ptr(a, b, c)
  {
   cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
  }
}

Map data to the device indicated by device(dnum). Note that “to” means the device will read from that data (matrices a and b) while “tofrom” (matrix c) means after the device can read from and write to. The device will write the result to matrix c before sending it back to host.

The pragma “ omp target variant…” will tell the device to execute whatever code following it. In this case, it tells the device to execute the oneMKL routine cblas_dgemm. If the device is not found, it will default back to the host device.


mkl_free(a);
mkl_free(b);
mkl_free(c);

Release the memory used by the oneMKL routine.

Fortran Implementation


include "mkl_omp_offload.f90"
include "common_blas.f90"

program dgemm_example

use onemkl_blas_omp_offload_ilp64

character*1 :: transa = 'N', transb = 'N'
integer :: i, j, m = 5, n = 3, k = 10
integer :: lda, ldb, ldc
double precision :: alpha = 1.0, beta = 1.0
double precision,allocatable :: a(:,:), b(:,:), c(:,:)

lda = m
ldb = k
ldc = m

allocate(a(lda,k))
allocate(b(ldb,n))
allocate(c(ldc,n))

if (.not. allocated(a)) goto 998
if (.not. allocated(b)) then
   deallocate(a)
   goto 998
end if
if (.not. allocated(c)) then
   deallocate(a)
   deallocate(b)
   goto 998
end if

! initialize matrices
call dinit_matrix(transa, m, k, lda, a)
call dinit_matrix(transb, k, n, ldb, b)
call dinit_matrix('N', m, n, ldc, c)

integer :: i, j

do i = 1, m
   do j = 1, k
      a(i, j) = 1.1
   end do
end do

do i = 1, k
   do j = 1, n
      b(i, j) = 2.2
   end do
end do

do i = 1, m
   do j = 1, n
      c(i, j) = 2.2
   end do
end do


! Calling dgemm on the GPU
!$omp target data map(a,b,c)
!$omp target variant dispatch use_device_ptr(a,b,c)
call dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
!$omp end target variant dispatch
!$omp end target data


deallocate(a)
deallocate(b)
deallocate(c)

stop

998 print *, 'Error: cannot allocate matrices' 
999 stop 1
end program

Explanation:
It is very similar to the C implementation.
This portion of the code maps the data to the device and execute the oneMKL routine in the device:


! Calling dgemm on the GPU
!$omp target data map(a,b,c)
!$omp target variant dispatch use_device_ptr(a,b,c)
call dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
!$omp end target variant dispatch
!$omp end target data

Building and Linking

For C

The Intel® oneAPI DPC++/C++ Compiler (icx) is needed for OpenMP* offload.The following shows how to build and dynamic link in sequential mode to oneMKL.


icx -fiopenmp -fopenmp-targets=spir64  -DMKL_ILP64  -I"${MKLROOT}/include" -m64 <your code> -fiopenmp -fopenmp-targets=spir64  -fsycl  -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lsycl -lOpenCL -lstdc++ -lpthread -lm -ldl -o <Executable file>

For Fortran

The Intel® Fortran Compiler (Beta) (ifx) is needed for OpenMP* offload. The following shows how to build and dynamic link in parallel (OpenMP*) mode to oneMKL.


ifx -i8 -fiopenmp -fopenmp-targets=spir64  -DMKL_ILP64  -I"${MKLROOT}/include" -m64 -fpp <your code> -fiopenmp -fopenmp-targets=spir64  -fsycl  -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lsycl -lOpenCL -lstdc++ -lpthread -lm -ldl -o <executable file>

You can also use the Link Line Advisor to build and link your code.

Conclusion

OpenMP* offload allows applications written in C and Fortran to run on accelerators other than the Intel® CPUs. This is very important since there are a lot of legacy applications out there implemented in C and Fortran. It would normally be not practical to rewrite a large application from scratch using a different language in order to run on GPUs.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Executing Intel® oneAPI Math Kernel Library Routines on GPU Using OpenMP* Offload

Introduction

Requirements

Examples

C Implementation

Fortran Implementation

Building and Linking

For C

For Fortran

Conclusion

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Executing Intel® oneAPI Math Kernel Library Routines on GPU Using OpenMP* Offload

Introduction

Requirements

Examples

C Implementation

Fortran Implementation

Building and Linking

For C

For Fortran

Conclusion

Product and Performance Information