Executing Intel® oneAPI Math Kernel Library Routines on GPU Using OpenMP* Offload

ID 659410
Updated 12/4/2020
Version Latest
Public

Introduction

The Intel® oneAPI Math Kernel Library (oneMKL) is a set of optimized math functions that allow speed up of compute-intense applications in the areas of science, engineering, or financial applications. oneMKL can be used to optimize code for current and future generations of Intel® CPUs and GPUs and other accelerators. This article will show how to offload the computation to Intel GPUs using OpenMP* offload in C and Fortran.

Requirements

This page contains hardware and software requirements to run oneMKL. To get oneMKL, please download and install the Intel® oneAPI Base Toolkit. In order to use OpenMP* offload, the Intel® oneAPI HPC Toolkit is also needed.

Examples

This section will show how use oneMKL matrix multiply function with OpenMP* offload using C and Fortran.

C Implementation

#include<stdio.h> #include<omp.h> #include "mkl.h" #include "mkl_omp_offload.h" int dnum = 0; int main(){ double *a, *b, *c, alpha, beta; MKL_INT m, n, k, lda, ldb, ldc, i, j; alpha = 1.0; beta = 1.0; m = 1000; n = 1000; k = 1000; lda = m; ldb = k; ldc = m; // allocate matrices a = (double *)mkl_malloc((lda * k) * sizeof(double), 64); b = (double *)mkl_malloc(ldb * n * sizeof(double), 64); c = (double *)mkl_malloc(ldc * n * sizeof(double), 64); if ((a == NULL) || (b == NULL) || (c == NULL)) { printf("Cannot allocate matrices\n"); return 1; } // initialize matrices for (MKL_INT i = 0; i < lda * k; i++) { a[i] = 1.1; } for (MKL_INT i = 0; i < ldb * n; i++) { b[i] = 2.2; } for (i = 0; i < sizec; i++) { c[i] = 2.0; } #pragma omp target data map(to:a[0:sizea],b[0:sizeb]) map(tofrom:c[0:sizec]) device(dnum) { #pragma omp target variant dispatch device(dnum) use_device_ptr(a, b, c) { cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc); } } mkl_free(a); mkl_free(b); mkl_free(c); return 0; }

Explanation:

#include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h"


These include files are needed for oneMKL routines and OpenMP* Offload.

int dnum = 0;

This is the device ID for the GPU.

double *a, *b, *c, alpha, beta; MKL_INT m, n, k, lda, ldb, ldc, i, j; MKL_INT sizea, sizeb, sizec; alpha = 1.0; beta = 1.0; m = 1000; n = 1000; k = 1000; lda = m; ldb = k; ldc = m; sizea = lda * k; sizeb = ldb * n; sizec = ldc * n;

Create and initialize parameters for the oneMKL cblas_gemm routine.

a = (double *)mkl_malloc((lda * k) * sizeof(double), 64); b = (double *)mkl_malloc(ldb * n * sizeof(double), 64); c = (double *)mkl_malloc(ldc * n * sizeof(double), 64); if ((a == NULL) || (b == NULL) || (c == NULL)) { printf("Cannot allocate matrices\n"); return 1; }

Allocate memories for matrices a, b and c.

for (MKL_INT i = 0; i < lda * k; i++) { a[i] = 1.1; } for (MKL_INT i = 0; i < ldb * n; i++) { b[i] = 2.2; } for (i = 0; i < sizec; i++) { c[i] = 2.0; }

Initialize those matrices.

#pragma omp target data map(to:a[0:sizea],b[0:sizeb]) map(tofrom:c[0:sizec]) device(dnum) { #pragma omp target variant dispatch device(dnum) use_device_ptr(a, b, c) { cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc); } }

Map data to the device indicated by device(dnum). Note that “to” means the device will read from that data (matrices a and b)  while “tofrom” (matrix c) means after the device can read from and write to.  The device will write the result to matrix c before sending it back to host.

The pragma “ omp target variant…” will tell the device to execute whatever code following it.  In this case, it tells the device to execute the oneMKL routine cblas_dgemm.  If the device is not found, it will default back to the host device.

mkl_free(a); mkl_free(b); mkl_free(c);

Release the memory used by the oneMKL routine.

Fortran Implementation

include "mkl_omp_offload.f90" include "common_blas.f90" program dgemm_example use onemkl_blas_omp_offload_ilp64 character*1 :: transa = 'N', transb = 'N' integer :: i, j, m = 5, n = 3, k = 10 integer :: lda, ldb, ldc double precision :: alpha = 1.0, beta = 1.0 double precision,allocatable :: a(:,:), b(:,:), c(:,:) lda = m ldb = k ldc = m allocate(a(lda,k)) allocate(b(ldb,n)) allocate(c(ldc,n)) if (.not. allocated(a)) goto 998 if (.not. allocated(b)) then deallocate(a) goto 998 end if if (.not. allocated(c)) then deallocate(a) deallocate(b) goto 998 end if ! initialize matrices call dinit_matrix(transa, m, k, lda, a) call dinit_matrix(transb, k, n, ldb, b) call dinit_matrix('N', m, n, ldc, c) integer :: i, j do i = 1, m do j = 1, k a(i, j) = 1.1 end do end do do i = 1, k do j = 1, n b(i, j) = 2.2 end do end do do i = 1, m do j = 1, n c(i, j) = 2.2 end do end do ! Calling dgemm on the GPU !$omp target data map(a,b,c) !$omp target variant dispatch use_device_ptr(a,b,c) call dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) !$omp end target variant dispatch !$omp end target data deallocate(a) deallocate(b) deallocate(c) stop 998 print *, 'Error: cannot allocate matrices' 999 stop 1 end program

Explanation:
It is very similar to the C implementation.
This portion of the code maps the data to the device and execute the oneMKL routine in the device:

! Calling dgemm on the GPU !$omp target data map(a,b,c) !$omp target variant dispatch use_device_ptr(a,b,c) call dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) !$omp end target variant dispatch !$omp end target data

Building and Linking

For C

The Intel® oneAPI DPC++/C++ Compiler (icx) is needed for OpenMP* offload.The following shows how to build and dynamic link in sequential mode to oneMKL.

icx -fiopenmp -fopenmp-targets=spir64 -DMKL_ILP64 -I"${MKLROOT}/include" -m64 <your code> -fiopenmp -fopenmp-targets=spir64 -fsycl -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lsycl -lOpenCL -lstdc++ -lpthread -lm -ldl -o <Executable file>
For Fortran

The Intel® Fortran Compiler (Beta) (ifx) is needed for OpenMP* offload. The following shows how to build and dynamic link in parallel (OpenMP*) mode to oneMKL.

ifx -i8 -fiopenmp -fopenmp-targets=spir64 -DMKL_ILP64 -I"${MKLROOT}/include" -m64 -fpp <your code> -fiopenmp -fopenmp-targets=spir64 -fsycl -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lsycl -lOpenCL -lstdc++ -lpthread -lm -ldl -o <executable file>

You can also use the Link Line Advisor to build and link your code.

Conclusion

OpenMP* offload allows applications written in C and Fortran to run on accelerators other than the Intel® CPUs.  This is very important since there are a lot of legacy applications out there implemented in C and Fortran.  It would normally be not practical to rewrite a large application from scratch using a different language in order to run on GPUs.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex​.  

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates.  See backup for configuration details.  No product or component can be absolutely secure. 

Your costs and results may vary. 

Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation.  Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.