Migrating CUDA* Library Code: Converting cuBLAS and cuRAND Operations from CUDA to SYCL*

Get the Latest on All Things CODE

author-image

By

SYCL* is an open, industry-standard, cross-platform programming framework. Migrating proprietary vendor-locked CUDA* code to SYCL enables efficient parallel programming across heterogeneous architectures. This blog will talk about two code samples that demonstrate how to migrate cuBLAS and cuRAND math routines from CUDA to equivalent Intel® oneAPI Math Kernel Library (oneMKL) SYCL API’s BLAS (Basic Linear Algebra Subprograms) and RNG (Random Number Generators) functions respectively. The samples showcase the outcomes of using the Intel® DPC++ Compatibility Tool, an automated tool for CUDA-to-SYCL code migration. 

Through the code samples, you will learn about using oneMKL BLAS and RNG functions that can be offloaded to a GPU/CPU for accelerated computations across multi-vendor hardware. The oneMKL BLAS functions (SYCL equivalent of cuBLAS library functions) will help you execute accelerated BLAS routines for basic matrix and vector operations. While the oneMKL RNG functions (SYCL equivalent of cuRAND library functions) will help you efficiently generate pseudorandom, quasi-random and non-deterministic numbers with discrete and continuous distributions. You can thus improve the performance of AI and HPC applications involving numerical computations with cuBLAS and cuRAND functionalities. 

Before we go into the code samples' details, let us briefly look at the oneMKL SYCL API and the Intel DPC++ Compatibility Tool for CUDA-to-SYCL migration. 

Intel® oneAPI Math Kernel Library (oneMKL) SYCL API: An Overview 

oneMKL is a very flexible and the most-used1 math library for performing high performance numerical computations on Intel® CPUs and GPUs. As an extension of the Intel® Math Kernel Library (MKL), it facilitates enhanced GPU support. It supports SYCL framework across the latest Intel® hardware, including Intel® Data Center GPU Max Series and 5th Gen Intel® Xeon® Scalable Processor (Emerald Rapids)

oneMKL SYCL API is a part of the open, industry-standard, unified oneAPI programming model for accelerated computing across multi-vendor heterogeneous architectures. It provides SYCL routines covering various mathematical domains, including BLAS, Sparse BLAS, LAPACK, RNG, summary statistics, vector math, Fourier transforms, and data fitting. The complete open-source code implementation of the oneMKL SYCL API is available in the form of the oneMKL Interfaces project on GitHub.   

►For more information on oneMKL’s SYCL support, check out the developer reference guide.

Let us quickly look at the oneMKL BLAS and RNG functionalities on which the code samples discussed in this blog are based. 

oneMKL BLAS Functions 

oneMKL provides SYCL interfaces to various BLAS routines categorized as follows: 

Many of these oneMKL BLAS functions map to their equivalent cuBLAS API functions. 

►Check out the oneMKL BLAS SYCL API implementations on GitHub.

oneMKL RNG Functions 

oneMKL RNG includes host routines and device routines (a set of functions callable directly from DPC++ kernels). Most of them can be executed on the same device API.  

The oneMKL RNG host and device functions to their equivalent cuRAND functions as follows: 

cuRAND API Support oneMKL API support 
CURAND_RNG_PDEUSO_DEFAULT  philox4x32x10 
CURAND_RNG_PSEUDO_MT19937 mt19937

 

►For the complete functionality mapping, check out the article: Random Number Generation with cuRAND and oneMKL

►Check out the oneMKL RNG SYCL API implementations on GitHub.

Intel® DPC++ Compatibility Tool: An Overview 

Intel DPC++ Compatibility Tool and its open-source counterpart, SYCLomatic, are automated tools for easy CUDA-to-SYCL code migration. They automatically migrate the majority of the CUDA library function calls to C++ with SYCL, as shown in the figure below: 

                                                Fig.1: CUDA-to-SYCL Migration Workflow

 

The tool migrates most CUDA math library calls to equivalent oneMKL SYCL API calls. 

►To know more about the Intel DPC++ Compatibility Tool, check out the article: Easy CUDA to SYCL Migration.

About the Code Samples 

The cuBLAS migration sample comprises 52 basic programs, each based on a single oneMKL BLAS function equivalent to a cuBLAS routine. The functions covered have been categorized according to the level of difficulty as follows: 

  • 14 samples of Level 1 include basic routines for vector-vector operations such as computing the sum of vector magnitudes, the dot product of vectors, vector-scalar product, and rotation of vector points. 

  • 23 samples of Level 2 include routines for matrix-vector operations such as computing a matrix-vector product, performing rank-1 and rank-2 updates of matrices, and solving a linear system of equations with a triangular matrix. 

  • 15 samples of Level 3 include routines for complex matrix-matrix operations such as computing matrix-matrix product, performing rank-k and rank-2k matrix updates, and solving a triangular matrix equation. 

 The cuRAND to oneMKL RNG migration sample consists of 48 sub-samples, each demonstrating a oneMKL RNG function equivalent to a cuRAND routine. There are separate folders for different engines (basic random number generator classes), such as mrg32k3a, mt19937, and philox4x32x10. Each has host and device routines for random number generation with different statistical distributions such as uniform, normal, lognormal and poisson

Both the code samples contain two sets of source codes: 

  1. 01_sycl_dpct_output folder contains the direct SYCL-migrated outputs from the Intel DPC++ Compatibility Tool. Those codes include some unmigrated and/or incorrectly migrated parts that must be manually fixed for functional correctness. 

  1. 02_sycl_dpct_migrated folder contains manually repaired and fully functional outputs from the Intel DPC++ Compatibility Tool. 

►Check out the key implementation details to learn how to build and execute the cuBLAS and cuRAND migration samples.

Sample Outputs 

Following is an example output of executing the amax.cpp source file from the cuBLAS migration sample. It illustrates the use of the iamax BLAS routine of oneMKL that finds the index of the vector element with the greatest absolute value. 

[  0%] Building CXX object 02_sycl_dpct_migrated/Level-1/CMakeFiles/amax.dir/amax.cpp.o
[100%] Linking CXX executable amax 
[100%] Built target amax 
A 
1.00 2.00 3.00 4.00  
===== 
result 
4 
===== 
[100%] Built target run_amax

An example output of executing the mt19937_uniform.cpp source file from the cuRAND migration sample looks as follows: 

Scanning dependencies of target mt19937_uniform 
[ 50%] Building CXX object 02_sycl_dpct_migrated/mt19937/CMakeFiles/mt19937_uniform.dir/mt19937_uniform.cpp.o 
[100%] Linking CXX executable ../../bin/mt19937_uniform 
[100%] Built target mt19937_uniform 
Host 
0.966454 
0.778166 
0.440733 
0.116851 
0.007491 
0.090644 
0.910976 
0.942535 
0.939269 
0.807002 
0.582228 
0.034926 
===== 
Device 
0.966454 
0.778166 
0.440733 
0.116851 
0.007491 
0.090644 
0.910976 
0.942535 
0.939269 
0.807002 
0.582228 
0.034926 
===== 
[100%] Built target run_mt19937_uniform

What’s Next? 

Try out the cuBLAS migration sample and cuRAND migration sample on GitHub. Get started with oneMKL for accelerated math computations on Intel® architectures and Intel DPC++ Compatibility tool for easy, automated CUDA-to-SYCL migration for efficient parallel programming across heterogeneous hardware.  

We also encourage you to know about other AI, HPC, and Rendering tools in Intel’s oneAPI-powered software portfolio.   

Additional Resources 

Get the Software 

You can get oneMKL and Intel DPC++ Compatibility Tool as parts of the Intel oneAPI Base Toolkit. You can also download the stand-alone version of oneMKL