Intel® oneAPI Math Kernel Library (oneMKL) Release Notes

Published: 07/03/2019  

Last Updated: 08/23/2022

By Khang T Nguyen, Abhinav Singh

 

Where to Find the Release

Intel® oneAPI Math Kernel Library

New in This Release

NOTE for NuGet Package Manager Users: There will be a delay in providing oneMKL NuGet package for the version 2021.4 release. We are working to get package size that work within NuGet size limits.  Because of this, oneMKL packages for 2021.4 will not be available at the oneAPI version 2021.4 release.  We hope to have these uploaded soon. Please check back for information on these packages.  If you do not use NuGet package manager you are not affected.

KNOWN ISSUE

  • Intel oneAPI Math Kernel Library, 2022.1.0 does not include the latest functional and security updates. Intel oneAPI Math Kernel Library, 2022.1.1 is targeted to be released in May 2022 and will include additional functional and security updates. Customers should update to the latest version as it becomes available.
  • After installing the oneAPI Base Toolkit 2022.1, compiling applications with Win32 platform settings that require oneAPI Math Kernel Library (oneMKL) will fail. 32-bit oneMKL on Windows* OS are provided separately as part of Intel® oneAPI Base Toolkit 32-bit package. It can be downloaded here as an add-on.

2022.2

System Requirements  Bug Fix Log

Features

  • BLAS

    • Improved argument and error checking and added exception support for DPC++ interfaces. 
    • Added device timing support on GPU with MKL_VERBOSE=2. 
    • Added mixed-precision and low-precision support for GEMM and GEMM_BATCH for both DPC++ and OpenMP Offload.
    • Added low-precision and complex support for select Level 1 BLAS APIs.
    • Added threading support and improved CPU performance for OMATADD.
    • Improved GEMV performance on Intel GPUs.
  • LAPACK

    • Extended C/Fortran OpenMP offload to support the OpenMP* 5.1 specification.
    • Enabled MKL_VERBOSE support for LAPACK GPU functionality for DPC++ and OpenMP offload.
    • Introduced GETRFNP_BATCH (group batched non-pivoting LU factorization) for DPC++ USM and C/Fortran APIs, and enabled C/Fortran OpenMP offload support. 
    • Introduced DPC++ interface for gels_batch strided functionality for single and double precision real cases supported on GPU, and C/Fortran OpenMP GPU offload API. Note that complex data types and transposed case are not supported. 
    • Introduced QR factorization routine for tall-skinny matrices (?getsqrhrt and related routines) from Netlib LAPACK 3.9.1; integrated bug fixes from Netlib LAPACK 3.10.1, including an out-of-bounds read in ?larrv that resolves CVE-2021-4048. oneMKL LAPACK functionality is now aligned with Netlib LAPACK 3.10.1. 
  • ScaLAPACK

    • Integrated bug fixes from Netlib ScaLAPACK 2.2.0. Intel® oneMKL ScaLAPACK functionality is now aligned with Netlib ScaLAPACK 2.2.0. 
  • Sparse

    • Introduced C Openmp Offload API for mkl_sparse_order() for sorting sparse matrices on GPU. 
    • Improved single threaded performance on CPU for MKL_SPARSE_?_MV
    • Improved performance for DPC++ API SPARSE::MATMAT
    • Improved performance for DPC++ API SPARSE::GEMM
    • Improved performance for C OpenMP Offload APIs, MKL_SPARSE_?_MV(),  MKL_SPARSE_?_MM(), MKL_SPARSE_?_TRSV(), and MKL_SPARSE_SP2M() on GPU.
  • DFT

    • Introduced mixed complex/real types DPC++ APIs for out of place R2C/C2R FFT.
    • Improved commit time for 2/3D real FFTs on GPU.
    • Fixed failures of 1D C2C/R2C/C2R FFT with non-unit stride.
  • Vector Math

    • Improve performance for double-precision complex multiplication, exponential and logarithm, double precision log1p on GPU.
    • Improve performance for single-precision inverse complementary error function and reciprocal cube root on GPU.
    • Re-enabled CPU fallback capability so that all Vector Math functions can be called transparently on all computational devices for OpenMP offload interface.  
  • Library Engineering

    • Removed mkl_link_line_advisor.htm file from product.
    • Extended oneMKL verbose output for GPU to report host and device time.
    • Fixed MKL_VERBOSE=2 to display both CPU and GPU time.
    • Added _64 LAPACKE and CBLAS APIs for ILP64 interface so that users can mix LP64 and ILP64 interfaces in one application.
    • Link Tool allows selecting OpenMP Offload for 32-bit oneMKL
    • Fixed oneMKL disregarding CPU affinity set by users when certain functions were called.

Known Issues and Limitations

  • GEMM_BATCH group APIs may crash on Intel® Arc™ Graphics. 
  • Extremely large SPARSE::MATMAT workloads may still result in hangs or segmentation faults.
  • Use Release mode as workaround for non-functional Debug mode TRSV /SP2M C OpenMP offload APIs and SPARSE::TRSV/SPARSE::MATMAT on Win32e.
  • Use 0-base indexing or a USM container as workaround for SPARSE::TRMV() may result in an exception with 1-base indexing and sycl::buffer container.
  • MKL_VERBOSE support on GPU for LAPACK functions GEQRF, GETRI_OOP_BATCH and GETRS_BATCH is missing.
  • The LAPACK DPC++ getrf (USM) function may work incorrectly on Intel® Iris® XE Max Graphics (incorrect results or exception).
  • CMake Config for oneMKL could fail to find correct oneMKL cluster library for MPICH2 on Linux. As possible workaround please use Link Line Advisor to manually link oneMKL, or find and replace `MKL_MPI STREQUAL "mpich"` with `MKL_MPI MATCHES "mpich"` in $MKLROOT/lib/cmake/MKLConfig.cmake.
  • Vector Math functions for OpenMP Offload for C and Fortran cannot be used in "dynamic" linking mode.
  • int8 gemm and gemm-like routines may throw an exception on Intel® Gen9 or Gen12LP family GPUs for large matrices.
  • Mixed precision GEMM calls might result incorrect results on Intel® Iris® XE Max Graphics when m or n parameter is 1.
  • On 11th generation Intel Core mobile processors, performance for GEMV routines might be slower depending on problem size.
  • The ScaLAPACK C examples fail to build with the Intel® oneAPI DPC++/C++ compiler (icx). As a workaround, the Intel® C/C++ Classic compiler (icc) may be used.

2022.1

System Requirements  Bug Fix Log

Features

  • BLAS

    • Extended C/Fortran OpenMP offload to support the OpenMP* 5.1 specification 
    • Enabled MKL_VERBOSE support for BLAS GPU functionality for DPC++ and OpenMP offload
  • Intel® Distribution for LINPACK* Benchmark

    • Continued performance enhancements for 3rd Generation Intel Xeon Scalable processors and the processor code named, “Sapphire Rapids”
  • Tranpose

    • Added new DPC++ API for omatadd_batch function
    • Enabled MKL_VERBOSE support for CPU for transpose domain
  • LAPACK

    • Improved performance for LU, batch strided LU solve and inverse on Intel GPU.
    • Improved performance for real precision divide-and-conquer SVD ({S,D}GESDD) on CPU.
    • Introduced multishift QZ algorithm from Netlib LAPACK 3.10.0; Integrated other minor bug fixes from Netlib LAPACK 3.9.1 – 3.10.
    • Modified a set of ?LAQR[0-5] computational kernels used for solving non-symmetric eigenvalue problems (?GEEV, ?GEES, ?GEEVX and ?GEESX) as was done in Netlib LAPACK 3.10. 
  • Sparse

    • Improved performance for sparse::gemm with col-major on all GPUs.
    • Extended the support for C OpenMP offload for MKL_SPARSE_?_MM with column-major layout
  • DFT

    • Relaxed padding requirement for complex-to-real out of place FFT on GPU.
  • Vector Math

    • Introduced _FTZDAZ_DEFAULT to accurately represent the default VML mode for C interface.
    • Improved the DPC++ interface to have configurable CPU fallback. It is enabled by default to be compatible with previous versions.
    • Improved cbrt, erf performance for Intel discrete GPUs
    • Improved accuracy for several functions (SPOW3O2/HA, DPOWX/EP, VSCOSD/LA, VSCOSD/EP, SLGAMMAF/HA, CDIV/EP, ZDIV/EP)
  • Vector Statistics

    • Introduced Device DPC++ APIs for Bernoulli distribution and mcg31m1 / mcg59 engines
    • Optimized Device DPC++ implementation for Gaussian distribution with sycl::vec<16> and mrg32k3a engine
    • Optimized CPU implementations for exponential, lognormal, Cauchy, Weibull, Rayleigh, Gumbel distributions for the vector length 1E5 and higher
  • Data Fitting

    • Introduced experimental DPC++ APIs with GPU support for linear / cubic Hermite splines, uniform / non-uniform partitions hints and construction / interpolation routines
  • Library Engineering

    • Introduced Single Dynamic Library linking mode support for applications using OpenMP offload for LAPACK, sparse BLAS (C only), Vector Statistics, DFTi and FFTW APIs.
    • Dispatch Half precision by default, the enable Macro is MKL_ENABLE_INSTRUCTIONS=AVX512_E4
    • Set threading layer to GNU OpenMP when gomp library is loaded, or set it to Intel threading by default.

Known Issues and Limitations

  • Workaround for sp2m offload example with asynchronous execution using OpenCL backend failure, use Level Zero backend (default) or run using synchronous mode with OpenCL backend.
  • Use Release mode as workaround for non-functional Debug mode trsv /sp2m offload APIs and sparse::trsv/sparse::matmat on Win32e.
  • Experimental Data Fitting DPC++ APIs are not supported using mkl_sycld.dll on Windows OS.
  • On GPUs lacking native double-precision support, non-batched and batched LAPACK functions {c,s}getr{f,s} via OpenMP offload with static linking may fail with an error that double type is not supported on this platform. As a workaround, use dynamic linking or use the -fsycl-device-code-split=per_kernel compilation flag.
  • Real-to-complex and complex-to-real FFT with non-unit stride might return incorrect results if strides are flipped when switching from forward to backward transform as recommended by the oneMKL documentation. As a workaround, do not flip the strides when switching from forward to backward transform.
  • Due to a DPC++ issue, when calling LAPACK DPC++ routines on the CPU device and using an in-order queue, application-side kernels may not wait for the LAPACK calculations to finish and may produce incorrect results. As a workaround, call wait() on the queue after the LAPACK call for explicit synchronization.
  • Lognormal<double> device random number distribution with philox4x32x10 engine may produce wrong results on Gen9 GPU in case of Windows OS and enabled /Od option.
  • Using multiple host threads using L0 backend can cause segmentation faults, exceptions, or other unexpected behavior.
  • Beta distribution with mt2203 generator may produce the wrong random sequence on Xe HPG in case of C OpenMP offload API, OpenCL backend and Linux OS. As a workaround, please use Level0 backend.
  • Mrg32k3a generator may produce the wrong sequence on Xe HPG for Linux OS.

2022.0

System Requirements  Bug Fix Log

Features

  • BLAS

    • Extended DPC++ support for in-place and out of place matrix copy/transposition. 
      • oneapi::mkl::blas::{i,o}matcopy_batch 
  • LAPACK

    • Enabled C/C++ OpenMP* offload support for getri_oop_batch. 
    • Improved performance of double precision, non-pivoting batch strided LU factorization on GPU. 
    • Improved performance of out-of-place batch strided LU inverse on GPU. 
    • Renamed the LAPACK DPC++ function getrfnp_batch_strided to getrfnp_batch.
  • Sparse

    • Enabled C/C++ OpenMP* offload support for mkl_sparse_sp2m support and mkl_sparse_?_export_csr.
    • Improved performance of DPC++ oneapi::mkl::sparse::matmat for small to medium sizes.
  • DFT

    • Enabled MKL_VERBOSE support on GPU devices for DFT DPC++ and C/C++/Fortran OpenMP* offload.
  • Vector Math

    • Improved performance and stability.
  • Library Engineering

    • Enabled support of lp64 & ilp64 BLAS and LAPACK interfaces in a single application.

Known Issues and Limitations

  • LAPACK functions {sy,he}{ev,evd,gvd,gvx,trd} and gesvd for single precision may work incorrectly on Intel® Iris® Xe MAX Graphics / Intel® UHD Graphics for Intel® Processor Graphics Gen11.
  • In certain cases, to avoid crashes, oneMKL may force synchronization within OpenMP* offload functionality even when nowait clause is provided. 
  • DPC++ matmat examples can sporadically produce a segmentation fault due to a one-off memory allocation error in the example code in case of one-based indexing.
  • Random number generators uniform_with_host_helpers device example may fail on Gen9 and DG1 GPUs in case of Windows OS and enabled /Od option. 
  • Sparse BLAS C OpenMP* offload in asynchronous execution mode fails for OpenCL backend. Use level0 backend instead (default).

Deprecation

  • Dropped support for  Intel® Xeon Phi™ Processor x200 “Knights Landing (KNL)” and Intel® Xeon Phi™ Processors “Knights Mill (KNM)”. AVX2 is still supported for this architecture.  
  • Dropped MPICH2 support on Windows*.  
  • Dropped SGI MPI support on Linux*. 
  • Deprecated support of Microsoft Visual Studio* 2017 version with this release.
  • Removed cvf (stdcall) interface.
  • Renamed "cl::sycl::vector_class" and "sycl::vector_class" to "std::vector" for input events in DPC++ USM API.
  • Changed data type “half” to “sycl::half”.

Previous oneAPI Releases

2021

Release Notes  System Requirements  Bug Fix Log

 

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.