Intel® oneAPI Math Kernel Library (oneMKL) Release Notes

Published: 07/03/2019  

Last Updated: 09/15/2021

By Khang T Nguyen, Abhinav Singh

 

Where to Find the Release

Intel® oneAPI Math Kernel Library

New in This Release

NOTE for NuGet Package Manager Users: There will be a delay in providing oneMKL NuGet package for the version 2021.4 release. We are working to get package size that work within NuGet size limits.  Because of this, oneMKL packages for 2021.4 will not be available at the oneAPI version 2021.4 release.  We hope to have these uploaded soon. Please check back for information on these packages.  If you do not use NuGet package manager you are not affected.

2021.4

System Requirements   Bug Fix Log

Features

  • BLAS

    • ​Introduced half precision (float16) general matrix-matrix multiply (CBLAS_HGEMM()) with MKL_ENABLE_INSTRUCTIONS=AVX512_E4 for C. Half precision (float16) general matrix-matrix multiply falls back to CBLAS_SGEMM() when the float16 type is not supported natively on the processor.
    • Extended DPC++ GEMM_BATCH API to support alpha and beta scaling vectors. 
    • Improved Intel Advanced Matrix Extensions (GEMM_S8U8S32() and GEMM_BF16BF16F32()) performance. 
    • Enabled subdevice support for the OpenMP offload variant dispatch. 
  • Transpose

    • ​Introduced C/Fortran MKL_{I, O}MATCOPY_BATCH APIs
  • LAPACK

    • Introduced GETRI_OOP_BATCH and GETRI_OOP_BATCH_STRIDED (group and strided batch out-of-place LU inverse) functions, with C OpenMP* offload support for GETRI_OOP_BATCH_STRIDED.
    • Extended support to lp64 interfaces of batched LAPACK functions for OpenMP* offload 
    • Improved performance of ?GETRF for tall-and-skinny matrices on CPU 
    • Improved performance of getrs_batch_strided for OpenMP* offload
  • Sparse Solvers

    • Enabled mkl_progress support for TBB threading
    • Inspector Executor sparse BLAS BSR MM performance improved for block sizes 4 and 8
  • Sparse BLAS

    • Added new DPC++ API for sparse-sparse matrix multiply (sparse::matmat) 
    • Enabled column-major layout in DPC++ sparse:gemm
    • Enabled C/C++ openMP offload for Sparse BLAS Inspector Executed (IE) trsv
  • Vector Math

    • Improved threading control for Classic interface: TBB threading mode can be chosen to optimize application performance through setting of TBB control flag with vmlSetMode
    • Enabled support for single and double precision complex EXP, LOG and SQRT on GPU.
  • Vector Statistics 

    • Enabled C/C++/Fortran OpenMP* offload APIs for min/max, raw and central sums/moments, variation coefficient, skewness, kurtosis oneMKL Summary Statistics routines
    • Enabled oneMKL Random Number Generators (RNG) Multinomial, PoissonV, Hypergeometric, Negative Binomial and Binomial distributions on GPU (with DPC++ and OpenMP* offload APIs)  
    • Fixed the issue with Device RNG routines. Now usage of “-fno-sycl-early-optimizations” compilation flag is not required 
  • Library Engineering

    • Enabled 16-bit Floating point instruction detection
    • Improved integration with Intel Inspector 
    • Added debug configuration support to examples

Known Issues and Limitations

  • Using ordered queues for CPU device may require explicit synchronizations by queue.wait() or queue.wait_and_throw() before calls to oneMKL DPC++ API.
  • LAPACK functions {sy,he}{ev,evd,gvd,gvx,trd} for single precision may work incorrectly on Intel® Iris® Xe MAX Graphics / Intel® UHD Graphics for Intel® Processor Graphics Gen11.
  • DGETRFNP_BATCH_STRIDED for size 64x64 for Fortran OpenMP* offload may crash with static linking. As a workaround, please use dynamic linking or C OpenMP* offload.
  • DPC++ Sparse BLAS Matmat (sparse * sparse -> sparse) functionality currently only supports running on GPU and CPU devices, but not Host. The function may give incorrect results sporadically with Level Zero backend on CPU device..
  • GEMV_BATCH_USM DPC++ API might fail intermittently with L0 backend on DG1. 
  • BLAS APIs can crash sporadically when called from multiple host threads with L0 backend. 
  • DPC++ BLAS group batch APIs might hang or crash on DG1 if large batch sizes are given.

Deprecations

  • Support for Intel® Xeon Phi™ Processor x200 “Knights Landing (KNL)” and Intel® Xeon Phi™ Processors “Knights Mill (KNM)” is deprecated and will be removed in a future release.  Intel® Xeon Phi™ customers should continue to use compilers, libraries, and tools from the Intel® Parallel Studio XE 2020 and older PSXE releases, or compilers from the Intel® oneAPI Base Toolkit and Intel® oneAPI HPC Toolkit versions 2021.2 or 2021.1.

  • Vector Math

    • Extraneous non-const USM API was removed from interface header files. No source code changes needed.

2021.3

System Requirements   Bug Fix Log

Features

  • BLAS

    • ​Introduced DPC++ SYRK_BATCH for USM and buffer APIs, both group and strided, with CPU and GPU support
    • Introduced C/Fortran SYRK_BATCH group and strided APIs 
    • Introduced Data Parallel C++ ( DPC++ ) COPY_BATCH for USM and buffer APIs, both group and strided, with CPU and GPU support 
    • Improved DPC++ TRSM_BATCH performance for strided and group APIs for Intel® UHD Graphics for Intel® Processor Graphics and Intel® Xe 
    • Improved C/Fortran TRSM_BATCH performance for strided and group APIs for Intel® UHD Graphics for Intel® Processor Graphics and Intel® Xe
    • Fixed GEMM_BATCH deadlock issue with MKL_CBWR=AVX
  • Transpose

    • Introduced C/Fortran MKL_{I, O}MATCOPY_BATCH_STRIDED APIs
    • Enabled C/C++/Fortran OpenMP offload for MKL_{I, O}MATCOPY_BATCH_STRIDED APIs
  • LAPACK

    • Introduced getrf_batch (group batched LU factorization with partial pivoting) functions, with C OpenMP* offload support.
    • Improved performance of POTRF on Intel® UHD Graphics for Intel® Processor Graphics Gen9. 
    • Improved performance of TRTRI for small sizes on CPU. 
  • Sparse Solvers

    • Improved time reporting for oneMKL PARDISO reordering step
  • Sparse BLAS

    • Enabled support of oneapi::mkl::diag::unit in DPC++ sparse::trsv 
    • Enabled support of device memory allocation for DPC++ Sparse BLAS functionality
    • Improved performance of DPC++ sparse::gemm on on Intel® UHD Graphics for Intel® Processor Graphics Gen9 and Intel® Xe
  • Vector Math

    • Several Vector Math functions were optimized, and their performance improved on the GPU: POWX/POWR; LOG2/LOG10; EXP2; medium and low accuracy CSQRT, CEXP, and CLN
  • Vector Statistics 

    • Introduced DPC++ APIs for save/load random number generators (RNG) functionality 
    • Enabled Beta, Gamma and Chi-square RNG distributions on GPU (with DPC++ and OpenMP* offload APIs)  
    • Optimized performance of Gaussian RNG distribution on Intel® Advanced Vector Extensions 512 (Intel® AVX-512) architectures 
  • Library Engineering

    • Introduced CMake config file support 
    • Enabled DPC++ dynamic libraries support for all DPC++ enabled functionality on Windows* 
    • Added debug versions of mkl_sycl and mkl_tbb_thread  libraries on Windows*

Known Issues and Limitations

  • Intel® oneAPI Math Kernel Library (oneMKL) Vector Math(VM)  and Vector Statistics examples for OpenMP offload for Fortran for Windows* may not work with the 2021.3 compiler if static linking is used *. This is a known issue, which will be resolved in the next compiler/library update. As a workaround, dynamic (mixed) linking may be used.
  • If static linking is used for oneMKL Vector Math and Vector Statistics examples for OpenMP offload for Fortran for Linux,it may not work with the XeLP card. As a workaround, dynamic linking may be used.
  • Device Random Number Generators (RNG) routines should be used with the “-fno-sycl-early-optimizations” compilation flag.
  • LAPACK functions {sy,he}{ev,evd,gvd,gvx,trd} for single precision may work incorrectly on Intel® Iris® Xe  MAX Graphics / Intel® UHD Graphics for Intel® Processor Graphics Gen11.
  • FFT DPC++ dynamic linking and debug support on Windows* is broken and will produce an error. 
  • FFT C and Fortran offload functionality as well as DPC++ may fail on Intel® Xe with OpenCL runtime and/or LIBOMPTARGET_OPENCL_USE_SVM=1. Instead use the Level Zero runtime or the OpenCL runtime with LIBOMPTARGET_OPENCL_USE_SVM=0  (default). 
  • oneMKL CDFT examples can fail with MS HPC package because missed dependency in target cdft_support on mpi.mod file that MS HPC doesn't provide by default.
  • Hang observed with AXPY_BATCH_USM and COPY_BATCH_USM (Group) APIs for both real and complex single precision on DG1 using OpenCL driver and Windows when the product of group_count and group_size is greater than 503.  This hang was not observed when using the Level Zero driver.
  • DPC++ applications that works with oneMKL via third-party plugins loaded at runtime could fail at the end of execution on Linux in case of using DPC++ runtime with Level Zero backend because of incorrect unloading order of oneMKL, DPC++ runtime, and Level Zero. 

Deprecations

  • Support for Intel® Xeon Phi™ Processor x200 “Knights Landing (KNL)” and Intel® Xeon Phi™ Processors “Knights Mill (KNM)” is deprecated and will be removed in a future release.  Intel® Xeon Phi™ customers should continue to use compilers, libraries, and tools from the Intel® Parallel Studio XE 2020 and older PSXE releases, or compilers from the Intel® oneAPI Base Toolkit and Intel® oneAPI HPC Toolkit versions 2021.2 or 2021.1.

2021.2 

System Requirements   Bug Fix Log

Features

  • BLAS

    • ​Enabled 0 stride support on input matrices for BLAS batch strided API on CPU.
    • Enabled OpenMP* offload support for DGMM_BATCH, GEMV_BATCH and AXPBY functions. 
    • Optimized DGEMM and DGEMM_BATCH for certain sizes on Intel® UHD Graphics for Intel® Processor Graphics Gen9. 
    • Improved TRSM optimizations. 
    • Increased systolic GEMM performance by caching internal buffers.
    • Added support for USM pointers in OpenCL* for OpenMP* offload (export LIBOMPTARGET_OPENCL_USE_SVM=0).
    • Improved BLAS OpenMP* offload performance with the Level Zero runtime.
    • Fixed some memory leaks in BLAS OpenMP* offload.
  • LAPACK

    • Introduced {getrf, getrs, getrfnp, getrsnp}_batch_strided (batched LU factorization and solve, with and without partial pivoting) functions, with OpenMP* offload support.
    • Enabled OpenMP* offload support for POTRF, POTRS, POTRI. 
    • Extended support to lp64 interfaces of non-batched LAPACK functions for OpenMP* offload. 
    • Improved performance of {D,S}SYEVX and {Z,C}HEEVX for large sizes on CPU. 
    • Improved performance of DGEEV for small sizes on CPU. 
    • Improved performance of DGETRF and DPOTRF on Intel® UHD Graphics for Intel® Processor Graphics Gen9. 
  • FFT

    • Enabled OpenMP* offload support for C and Fortran FFTW interfaces. 
    • Level0 backened is enabled for C/Fortran OpenMP* offload functionality.
  • Sparse Solvers

    • Broadened support for extended precision iterative refinement in oneMKL PARDISO.
    • Improved performance of oneMKL PARDISO solving phase with multiple righthand sides.
    • Increased optional verbosity of the sparse extremal eigensolvers for Krylov-Schur algorithm by reporting the iterative stop reason. 
  • Sparse BLAS

    • Added C OpenMP* offload support for Sparse MM. 
    • Improved performance of lu_smoother functionality for CSR format. 
  • Vector Math

    • Introduced Strided API support for Data Parallel C++ (DPC++) and C/Fortran OpenMP* Offload.
    • Added support for USM pointers in OpenCL* for OpenMP* offload. 
    • Improved performance of the single precision tgamma and lgamma functions on the GPU by more than 3x. 
  • Vector Statistics 

    • Introduced Poisson and Exponential distributions with DPC++ device APIs. 
    • Enabled multivariate Gaussian distribution on GPU (with DPC++ and OpenMP* offload APIs).  
    • Enabled OpenMP* offload support for MT19937, MT2203 and Sobol engines. 
    • Added service distribution methods to set / get distribution’s parameters for DPC++ APIs. 
    • Added accurate method for integer (signed/unsigned) uniform distribution for DPC++ APIs. 
    • Added support for USM pointers in OpenCL* for OpenMP* offload. 
    • Added support of unsigned integer type in uniform distribution with DPC++ APIs.  
    • Improved OpenMP* offload performance.  
    • Fixed an incorrect behavior of SkipAhead and Leapfrog routines combination for MCG31M1 and MCG59 engines.  
    • Fixed Gaussian/Lognormal box_muller method seriality issues on GPU for MT19937/MRG32K3A engines. 
  • Data Fitting

    • Optimized memory consumption of data fitting tasks. 
  • Library Engineering

    • Extended Link Line Tool support for OpenMP* Fortran offload. 
    • Enabled support for PGI 19.1 and 20.1. 
    • Example's structure was refactored to better highlight different languages features and all examples were updated to use Cmake. Examples are not longer support ia32, but ia32 is still supported by oneMKL product.

Known Issues and Limitations

  • Dynamic linking on Windows* is supported only for the BLAS and LAPACK domains.
  • Intel® oneAPI oneMKL Vector Math and Random Number Generator examples for OpenMP* offload for Fortran may not work with the 2021.2 compiler in Windows*. This is a known issue, which will be resolved in the next compiler/library update. oneMKL Vector Math examples for OpenMP* offload for Fortran on Linux* are not affected.
  • If static linking is used for oneMKL Vector Math in Linux, OpenMP* offload for Fortran examples may not work with the XeLP card. As a workaround, dynamic linking may be used.
  • DPC++ and OpenMP* offload examples for oneMKL Vector Math may not work with one specific version of the compute runtime (21.05.18936). To resolve this issue, a more recent driver (21.06.18993, or later) may be used.
  • DPC++ application that works with oneMKL via third-party plugins loaded at runtime could fail at the end of execution on Linux in case of using DPC++ runtime with Level0 backend because of incorrect unloading order of oneMKL, DPC++ runtime, and Level0.
  • oneMKL VM examples for OpenMP* offload for Fortran may not work with the 2021.2 compiler in Windows*. This is a known issue, which will be resolved in the next compiler/library update. oneMKL VM examples for OpenMP* offload for Fortran on Linux* are not affected. 
  • Device Random Number Generators routines should be used with the “-fno-sycl-early-optimizations” compilation flag. 
  • oneapi::mkl::rng::device::engine_descriptor with philox4x32x10 engine may work incorrectly on Intel® Iris® Xe MAX Graphics GPUs with Level Zero backend. As a workaround, use OpenCL* backend. 
  • oneMKL standalone package, PyPI and Conda oneMKL devel packages missed Intel® oneAPI Threading Building Blocks (oneTBB) import library on Windows*. Linking with oneMKL oneTBB threading library statically will require to download oneTBB standalone package additionally to oneMKL to use import oneTBB library from it. You can download oneTBB for Windows* as single component or download tbb-devel package via PyPI and Conda. 
  • LAPACK DPC++ sygvd and hegvd for double precision via the USM APIs may produce inaccurate results on Intel® UHD Graphics for Intel® Processor Graphics Gen9. with the OpenCL* backend. As a workaround use the DPC++ buffer APIs or the Level0 backend. 
  • LAPACK functions {sy,he}gv{d,x} for single precision may work incorrectly on Intel® Iris® Xe MAX Graphics/Intel® UHD Graphics for Intel® Processor Graphics Gen11. 
  • LAPACK DPC++ routine potri for lower triangular matrices may produce incorrect results on a GPU device.
  • The oneMKL devel package (mkl-devel) for PIP distribution on Linux* and macOS* does not provide dynamic libraries symlinks (for more information see PIP GitHub issue #5919). In the case of dynamic or single dynamic library linking with oneMKL devel package (for more information see oneMKL Link Line Advisor ) you must modify link line with oneMKL libraries full names and versions. Refer to Step 1 in the get started guide.

2021.1 Initial Release

System Requirements   Bug Fix Log

Features

With this release, the product previously known as the Intel® Math Kernel Library(Intel® MKL) becomes the Intel® oneAPI Math Kernel Library (oneMKL).

Existing Intel® MKL product users can migrate to oneMKL with confidence knowing that Intel continues to provide support for the same C and Fortran APIs for CPUs as they have for years. 

oneMKL extends beyond traditional C and Fortran APIs with new support for two programming models to enable programing Intel GPUs: Data Parallel C++ (DPC++) APIs support programming for both the CPU and Intel GPUs, and C/Fortran OpenMP* Offload interfaces to program Intel GPUs.  

We have changed the versioning model for shared libraries; please refer to the developer guide for more details. 

The following table illustrates the general domain areas included in oneMKL and which areas have been provided under the new DPC++ and OpenMP* Offload programming models:

Domain CPU APIs Intel GPU APIs
  DPC++ C Fortran DPC++ C OpenMP* Offload Fortran OpenMP* Offload
BLAS and BLAS-like Extensions  Yes Yes Yes Yes Yes Yes
LAPACK and LAPACK-like Extensions Yes1 Yes Yes Yes1 Yes2 Yes2
ScaLAPACK No Yes Yes No No No
Vector Math Yes Yes Yes Yes5 Yes5 Yes3,5
Vector Statistics (Random Number Generators)  Yes1 Yes Yes Yes1 Yes2 Yes2,3
Vector Statistics (Summary Statistics) Yes1 Yes Yes Yes1 No No
Data Fitting No Yes Yes No No No
FFT/DFT Yes Yes Yes Yes Yes4 Yes4
Sparse BLAS Yes1 Yes Yes Yes1 Yes2 No
Sparse Solvers No Yes Yes No No No

1: Subset of the full functionality available. Refer to the DPC++ developer reference for full list of DPC++ functionality supported.  

2: Subset of the full functionality available. For the list of functionality, refer to the developer reference (C and Fortran

3: Supported on Linux* only. 

4: DFTI interfaces are supported; FFTW interfaces are not supported. 

5. Subset of the full functionality available. Refer to the DPC++ developer reference for full list of DPC++ functionality supported or to the developer reference for C and Fortran. Functions which are not implemented for GPU can still be used and will be executed transparently on the host CPU.

Performance Recommendation:

  • For DPC++ and OpenMP* offload on Windows*, use the OpenCL* runtime for the best performance of BLAS and LAPACK functionality.​ To enable the OpenCL* runtime, set the following environment variables: 
    • SYCL_BE=PI_OPENCL​ 
    • LIBOMPTARGET_PLUGIN=opencl

DPC++ Known Issues and Limitations

  • Dynamic linking on Windows* is supported only for the BLAS and LAPACK domains.
  • The custom dynamic builder tool does not support building custom dynamic libraries for the DPC++ interfaces. 
  • Device RNG routines should be used with the “-fno-sycl-early-optimizations” compilation flag on a CPU device.  
  • Discrete Fourier Transform (DFT) on Intel GPU and Intel® oneAPI Level Zero backend may result in incorrect results for large batch sizes. To run DFT on Intel GPU, set the environment variable SYCL_BE=PI_OPENCL. 
  • Real backward out-of-place DFT can produce incorrect results on Intel GPU. As a workaround, use the in-place transform. 
  • LU factorization (getrf) on Intel GPU may fail with an invalid argument error when used with an OpenCL* backend and in double precision. As a workaround, use the oneAPI Level Zero backend. 
  • Static linking on Windows* can take significant time (up to 10 minutes). Linking static libraries can lead to a large application size due to GPU kernels.  
  • USM APIs of Sparse BLAS only works with input arrays allocated by malloc_shared, so that the data is always accessible from host.
  • On Windows* DPC++ library has only Release version, and it can’t be used to build Debug version of DPC++ applications.
  • For DFT on GPU, user-defined strides with padding are not supported.

C/Fortan Known Issues and Limitations

  • OpenMP* offload is only supported for static libraries on Windows*. 
  • On Windows* (LAPACK (C only) and DFT domains), OpenMP* offload does not support Intel® oneAPI Level Zero backend and works only with the OpenCL* backend. To run OpenMP* offload, set the environment variable LIBOMPTARGET_PLUGIN=opencl.  
  • On Linux* (DFT domain), OpenMP* offload does not support oneAPI Level Zero backend and works only with the OpenCL* backend. To run OpenMP* offload, set the environment variable LIBOMPTARGET_PLUGIN=opencl. 
  • The custom dynamic builder tool does not support building custom dynamic libraries for OpenMP* offload C and Fortran interfaces. 
  • Intel® Fortran Compiler Classic (ifx) does not support the specific SYCL linking option (-fsycl-device-code-split), which may result in long execution times for first calls of SYCL-based functions. Also, functionality on DG1 may be affected – see the note about enabling double precision emulation below.
  • LU factorization (dgetrf) for OpenMP* offload may fail with an invalid argument error when used with an OpenCL* backend and in double precision. As a workaround, use the oneAPI Level Zero backend. 
  • Note that Intel® Fortran Compiler Classic 2021.1 (ifx) remains in beta release and does not yet support the full language. As such, some oneMKL Fortran examples may not compile with Intel® Fortran Compiler Classic 2021.1 (ifx).  
  • On Windows* 10 version 2004, the fmod function causes the floating point stack to be left in an incorrect state when called with the x parameter equal to zero. When certain oneMKL functions such as zgetrs are called at some point after the problematic fmod function call, the results returned may be incorrect (in particular, they may be NaNs).  If possible, avoid using this version of Windows* 10 or later until a fix is provided by Microsoft*.  Alternatively, after calling fmod, the floating point stack may be cleared by calling the emms instruction.
  • Vector Math and service domain headers for Fortran (mkl_vml.f90, mkl_service.fi) may display compile errors when compiled with GNU Fortran 10.10. As short-term solution, -fallow-invalid-boz needs to be added to the compilation line. 
  • Iterative sparse solvers (ISS RCI) changed the behavior of _init and _check functionality to make the calls to the latter optional and lets them correct the parameter inconsistency if called.
  • In Sparse BLAS, there are three stages for the usage model: the create/inspection stage, the execution stage and the destruction stage. For Sparse BLAS with C OpenMP* offload, only the execution stage can be asynchronously done, provided any data dependencies are already respected. Known limitation: user must remove the "nowait" clause from the mkl_sparse_?_create_csr and mkl_sparse_destroy calls and add a "#pragma omp taskwait" before the call to mkl_sparse_destroy in the Sparse BLAS C OpenMP* offload async examples to make them safe.
  • For DFT Fortran OpenMP* offload, only rank-1 input arrays are supported. Multidimensional input data must be represented using a rank-1 array.
  • For DFT OpenMP* offload to GPU, user-defined strides with padding are not supported.

Intel® Iris® Xe MAX Graphics known issues and limitations

Unsupported Functionality:
  • Double precision functionality is not supported on this platform. 
  • In addition, the following single precision Vector Math functions are less accurate than their CPU counterparts for very large or very small arguments, including denormals: atan2pi, atanpi, cdfnorm, cdfnorminv, cosd, cosh, erfc, erfcinv, erfinv, expm1, frac, hypot, invcbrt, invsqrt, ln, powx, sind, sinh, and tand. 
  • Several Vector Math functions do not have HA accuracy versions, and have only low accuracy (LA) and extended precision (EP) versions: atan2pi, atanpi, cos, cosd, cospi, log2, log10, pow, powx, sin, sincos, sind, sinpi, tan.

Other Issues:

  • On Windows*, use of the OpenCL* backend is required for some BLAS and LAPACK functionality.​

 

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.