Intel® oneAPI Deep Neural Network Library Release Notes

ID 763685
Updated 3/28/2024
Version 2024.1 (latest)
Public

author-image

By

This document provides a summary of new and changed product features.

Where to Find the Release

Please follow the steps to download Intel® oneAPI Base toolkit contained oneDNN from the Main Portal of Intel® oneAPI Base toolkit, and follow the installation instructions to install.

2024.1 (v3.4)

What's New

Performance Optimizations

  • Intel® Architecture Processors:

    • Improved performance for 4th generation Intel® Xeon® Scalable processors (formerly Sapphire Rapids).
    • Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
    • Improved RNN primitive performance with LBR_GRU cell.
    • Improved softmax performance on processors with Intel® AVX2 or Intel® AVX-512 instruction set support.
    • Improved fp32 inner product performance on processors with Intel® AVX2 instruction set support.
    • Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel® AVX-512 and Intel® AMX instruction set support.
    • Improved int8 matmul performance with transposed A tensor.
    • Improved performance of resampling primitive on processors with Intel® AVX2 instruction set support.
    • Improved performance of int8 convolution with post-ops.
    • Optimized batch matmul with binary post-op and broadcast mask 1 and 14.
    • Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
    • Improved performance of subgraphs including matmul and add operations and mixed int8 and bfloat16 data types with Graph API.
    • [experimental] Improved performance of reductionsoftmax and layernorm operations with experimental Graph Compiler backend.
    • [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
  • Intel® Graphics Products:

    • Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
    • Improved performance for the Intel® Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel® Arc graphics (formerly Alchemist and DG2) and the Intel® Data Center GPU Flex Series (formerly Arctic Sound).
    • Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
    • Improved convolution performance for cases relevant to the Stable Diffusion model.
    • Improved RNN primitive performance.
    • Improved pooling forward propagation performance.
    • Improved batched matmul performance for cases with 5 dimensions or more.
  • AArch64-based Processors:

    • Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
    • Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
    • Improved bf16 inner product product primitive performance with ACL.

Functionality

  • Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel® Graphics Products and support matmul with int8 wight compression.
  • Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel® Data Center GPU Max Series (formerly Ponte Vecchio).
  • Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
  • [experimental] Introduced unstructured sparsity support for processors with Intel® AMX® support relying on VCOMPRESS/VPEXPAND instructions.
  • Intel® Graphics Products
    • Introduced support for Intel® Data Center GPU Max 1550VG
    • Introduced PReLU post-op support for inner product and matmul primitives.

Usability

  • Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
  • Introduced accumulation mode control.
  • Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
  • Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
  • Reduced RNN primitive memory consumption on GPUs.
  • Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
  • Extended tensor constructor in Graph API to support memory allocation and management by the library.
  • Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
  • Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
  • Changed default optimization flags for AArch64 builds to -mcpu=generic to improve portability.

Validation

  • Improved benchdnn performance by optimizing bottlenecks in validation code.
  • Introduced --num-streams knob in benchdnn to support benchmarking in multi-stream scenarios.

Known Limitations

  • Intel® Datacenter GPU Flex Series driver for Windows has an issue resulting in program hangs or crashes when oneDNN primitives are created concurrently.
  • int8 concat primitive may produce incorrect results on integrated GPUs with current GPU driver.
  • fp32 pooling primitive may produce incorrect results in rare conditions on Intel® Datacenter GPU Max Series with current GPU driver.
  • reorder primitive causes segmentation fault for prime sizes exceeding 2^31 on Intel CPUs.
  • fp64 convolution and deconvolution produces incorrect results on integrated graphics in future Intel® Core processors (code name Arrow Lake)
  • int8 matmul primitive creation with fp32 bias fails on Intel® GPU Flex Series and Intel® Arc Graphics.

Breaking Changes

  • Updated minimal supported ACL version to 23.11 (was 23.02.1).

Third Party Programs File

Previous Releases

Performance Optimizations

  • Intel Architecture Processors:
    • Improved performance for 4th generation Intel® Xeon® Scalable processors (formerly Sapphire Rapids).
    • Improved int8 convolution performance with zero points on processors with Intel® AMX instruction set support.
    • Improved performance for the future Intel® Xeon® Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via CPU dispatcher control.
    • Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with  Intel® Advanced Vector Extensions 512 (Intel® AVX-512) and/or Intel® AMX instruction set support.
    • Improved s32 binary primitive performance.
    • Improved fp16, fp32, and int8 convolution performance for processors with Intel® Advanced Vector Extensions 2 (Intel® AVX2) instructions support.
    • Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
    • Improved performance of convolution for depthwise cases with Graph API.
    • [experimental] Improved performance of LLAMA2 MLP block with Graph Compiler.
  • Intel Graphics Products:
    • Improved performance for the Intel® Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel® Data Center GPU Flex Series (formerly Arctic Sound-M).
    • Reduced RNN primitive initialization time on Intel GPUs.
  • AArch64-based Processors:
    • Improved fp32 to bf16 reorder performance.
    • Improved max pooling performance with Arm Compute Library (ACL).
    • Improved dilated convolution performance for depthwise cases with ACL.

Functionality

  • Introduced group normalization primitive support. The functionality is currently available on CPUs.
  • Intel CPUs:
    • Introduced support for zero points in int8 convolution with groups and 3D spatial.

Usability

  • Extended verbose mode output:
    • Improved diagnostics on engine creation errors.
    • Added information on Graph API calls.
    • Added information on strides for non-dense memory objects.
    • Added values of runtime dimension.
    • Added indication that primitive descriptor was created with any memory format tag.
  • Introduced examples for Graph API.
  • Graph API constant tensor cache is now disabled by default and requires opt-in with dnnl::graph::set_constant_tensor_cache() call.
  • Reduced oneDNN Graph API memory consumption in certain scenarios.

Validation

  • Extended benchdnn performance reporting with primitive creation time.
  • Introduced cold cache mode in benchdnn.

Known Limitations

  • Current GPU OpenCL runtime for Linux has an issue resulting in convolution producing incorrect results on integrated GPUs based on Xe architecture. SYCL configuration is not affected.
  • Pooling, resampling, prelu, batch normalization, layer normalization, and eltwise primitives may sporadically produce incorrect results on Intel® Arc GPUs on Windows.
  • Current GPU driver for Linux has an issue resulting in program hangs or crashes when oneDNN primitives are executed concurrently on Intel® Datacenter GPU Max Series.
  • Extensive use of RNN primitive on Intel GPUs with default primitive cache setting may lead to a device reboot. Workaround: consider reducing primitive cache size to 100.
  • Int8 deconvolution with signed weights and activations may produce incorrect results of processors with Intel® AMX support.

What's New

  • Deliver production quality AI Deep Learning optimizations for 4th Gen Intel® Xeon® Scalable processor, Intel® Xeon® processor Max Series, Intel® Data Center GPU Flex Series, and Intel® Arc™ A-Series GPUs
  • With support for S8/S8 weights and activations enable greater input influence on the outcomes on 4th Gen Intel® Xeon® Scalable processor with Intel® Advanced Matrix Extensions (Intel® AMX) acceleration instruction set
  • Support wider operators -BF32 on 4th Gen Intel® Xeon® Scalable processor and TF32 Intel® Data Center GPU Flex Series and , Intel® Max Series GPUs for more accurate inferencing
  • Enable limited support for FP64 operators on Intel® Data Center GPU Max Series GPUs for high precision model deployment
  • Deliver experimental Graph API support (opensource only) to simplify integration to frameworks and extend optimization capabilities

Performance Optimizations

  • Intel® Architecture processors:
    • Improved performance for 4th generation Intel® Xeon Scalable processor (formerly Sapphire Rapids).
    • Introduced performance optimizations for bf16 floating point math mode on 4th generation Intel® Xeon Scalable processors (code name Sapphire Rapids). The bf16 math mode allows oneDNN to use bf16 arithmetic and Intel® AMX instructions in computations on fp32 data.
    • Introduced FP16 support and initial optimizations for future Intel® Xeon Scalable processor (code name Granite Rapids).
  • Intel® Processor Graphics and Xe architecture-based Graphics::
    • Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Introduced performance optimizations for tf32 floating point math mode on future Xe Architecture graphics (code name Ponte Vecchio). The tf32 math mode allows oneDNN to use tf32 arithmetic in computations on fp32 data.
    • Improved performance for Intel® Arc graphics (formerly Alchemist and DG2) and Intel® Data Center GPU Flex Series (formerly Arctic Sound-M).

Functionality

  • Introduced runtime output scales support in all primitives.
  • Introduced scales support in concat primitive.
  • Extended floating point math mode API with tf32 data type option.
  • Extended eltwise primitive with support for hardsigmoid algorithm.
  • Extended layer normalization primitive with support for mixed source and destination data types.
  • Extended depthwise post-op with support for arbitrary padding size. The implementation is available only on Intel processors.
  • Added limited fp64 data type support in convolution primitive. Optimized implementation is available for future Xe Architecture graphics (code name Ponte Vecchio).
  • Extended int8 convolution and deconvolution implementations on GPUs with arbitrary destination data type support.
  • Extended batch normalization primitive with dnnl_fuse_norm_add_relu flag that allows to fuse sum and relu operations. The implementation is available for Intel GPUs.
  • Extended GPU deconvolution primitive implementation with support for output scales and zero points.
  • Introduced new quantization scheme. Major changes include support for per-argument runtime scales in all primitives and unquantized bias.
  • Introduced support for Intel DPC++/C++ Compiler 2023.0, including new features from the SYCL 2020 standard.
  • Extended persistent cache to cover GPU engine object. This improvement allows applications to further reduce oneDNN initialization time.
  • Extended threadpool API with a function to indicate maximum available concurrency.
  • Extended binary primitive implementation on GPU with bfloat16 source and int8 destination support.

Usability

  • Added matmul_perf example that benchmarks matmul primitive for all supported data types.
  • Introduced annotations for JIT kernels to allow profilers like Linux perf to correctly label JIT code.
  • Extended verbose logs converter with RNN primitive support.
  • Added verbose output for dnnl_*gemm* calls.
  • Removed Level Zero headers from the list of build time dependencies.
  • Extended the set of supported format tags to cover formats used in applications.

Deprecated Functionality

  • Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in the future releases.
  • Static output scales are deprecated and will be removed in the next release.
  • Convolution Winograd algorithm implementation for int8 data type is deprecated and will be removed in the next release.

Breaking Changes

  • Changed formula for AUGRU RNN cell to align with Tensorflow. See proposal for details.
  • Removed deprecated APIs.
  • Removed operation descriptor object and made memory descriptor object opaque. See details in operation and memory descriptors RFC.
  • Removed creation time primitive scales support and primitive output scales support. See details in quantization scaling RFC.
  • Removed support for Intel DPC++/C++ Compiler with SYCL 1.2.1 (aka SYCL 2017) standard.
  • Removed Winograd convolution implementation for int8 data type.

No change from 2022.1 version to 2022.2 version.

Performance Optimizations

  • Intel® Processor Graphics and Xe architecture-based Graphics:
    • Improved performance for future Xe Architecture graphics (code name Ponte Vecchio).
    • Improved performance for future Arc graphics (code name Alchemist and DG2).
  • Intel® Architecture processors
    • Improved performance for future Intel® Xeon Scalable processors (code name Sapphire Rapids). The functionality is now enabled by default and requires Linux kernel 5.16 or later.
    • Improved performance of matmul primitive for processors with Intel® AVX-512 support.

New Functionality

  • Introduced bfloat16 destination support for int8 convolution, matmul and inner product primitives for processors with Intel® AVX-512 support and or future Intel® Xeon® Scalable processors (code name Sapphire Rapids)
  • Extended RNN primitive with support for AUGRU cell.
  • Added support for non-zero negative slope in ReLU post-op for batch normalization primitive.
  • Introduced support for mixed source and destination data types in softmax primitive.
  • Introduced persistent cache API. This functionality allows to serialize and reuse JIT kernels.

Usability

  • Reduced stack consumption in GEMM implementation.

Breaking Changes

  • Removed performance optimizations for Intel® Xeon® Phi processors. oneDNN will continue to be functional on these processors using Intel® AVX2 codepath..

Deprecated Functionality

  • Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in future releases.

Known issues and limitations

Performance Optimizations

  • Intel® Processor Graphics and Xe architecture-based Graphics:
    • Introduced initial optimizations for future Xe Architecture graphics (code name Ponte Vecchio).
    • Improved pooling and layer normalization primitives performance.
  • Intel® Architecture processors
    • Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids). The functionality is now enabled by default and requires Linux kernel 5.16.
    • Improved performance of matmul primitive for processors with Intel® Advanced Vector Extensions 512 (Intel® AVX-512) support.

New Functionality

  • Introduced support for compiler with SYCL 2020 standard support.
  • Introduced support for the ICX/ICPX and DPCPP compiler drivers available in the Intel® oneAPI DPC++ Compiler.

Usability

  • Added environment variables and build options with 'ONEDNN' prefix.

Breaking Changes

  • The Intel MKL-DNN compatibility API is removed. See Transition from Intel® MKL-DNN to oneDNN page for instructions on moving to the new API.

Deprecated Functionality

  • Support for Intel® Xeon Phi processors is deprecated and will be removed in the next release.
  • Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in future releases.

Known issues and limitations

Performance Optimizations

  • Improved primitive cache performance for Intel Graphics products.
  • Intel® Processor Graphics and Xe architecture-based Graphics:
    • Introduced initial optimizations for future Intel® Arc™ Graphics codenamed Alchemist (ACM). That includes optimizations of compute-bound primitives (Convolution, GEMM) for s8/u8, f16 and bf16 datatypes via DPAS (Dot Product Systolic Accumulate) instructions.
    • Improved performance of convolution and deconvolution after some OpenCL kernels were re-implemented using kernel code generator (jit:ir implementation as reported by DNNL_VERBOSE).
  • Intel® Architecture processors
    • Improved performance for future Intell® Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
    • Improved binary primitive performance for cases when one of the tensors is broadcasted.
    • Improved reorder primitive performance for memory formats with padding and/or zero points.
    • Improved performance of reduction primitive, reorder, shuffle primitives. 
    • Improved performance of depthwise forward convolution primitive for processors with Intel® AVX512 support.
    • Improved performance of forward inner product primitive for the shapes with minibatch equal to 1 for processors with Intel® AVX512 support.
    • Improved int8 GEMM performance for processors with Intell® AVX2 and Intel® DL Boost support.

New Functionality

  • Introduced PReLU post-op support in convolution and matmul.
  • Extended maximum allowed post-ops chain for compute primitives (convolution, deconvolution, inner product, and matmul) to 32.
  • Introduced support for zero points in sum post-op for convolution and matmul. The functionality is implemented only for CPUs.
  • Extended binary primitive with support for mixed data types for input tensors. The functionality is implemented only for CPUs.
  • Extended sum post-op for convolution and matmul primitives with support for mixed data types. The functionality is implemented only for CPUs.

Usability

  • Reduced overall library size by trimming down use of templates, OpenCL headers, and TBB headers. The configurations that benefitted the most are CPU only configuration with TBB threading.

Deprecated Functionality

  • Intel MKL-DNN compatibility API is deprecated and will be removed in the next update. See Transition from Intel MKL-DNN to oneDNN page for instructions on moving to new API.
  • Support for Intel Xeon Phi processors is deprecated and will be removed in the next release.

Known issues and limitations

Performance Optimizations

  • Extended primitive cache to improve primitive descriptor creation performance.
  • Improved primitive cache performance in multithreaded configurations.
  • Intel® Processor Graphics and Xe architecture-based Graphics:
    • Introduced initial optimizations for bfloat16 compute functionality for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
    • Improved performance of binary primitive and binary post-op for cases with broadcast and mixed source and destination formats.
    • Improved performance of reduction primitive.
    • Improved performance of depthwise convolution primitive with NHWC activations for training cases
  • Intel® Architecture processors
    • Introduced initial optimizations for bfloat16 functionality for future Intel® Xeon Scalable processor with Intel® AMX support (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
    • Improved performance of int8 compute functionality for future Intel® Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control. 
    • Introduced initial performance optimizations for future Intel® Core processor with Intel® AVX2 and Intel® DL Boost instructions support (code name Alder Lake).
    • Improved performance of int8 primitives for processors with Intel® SSE4.1 instruction set support.
    • Improved performance of int8 and bfloat16 RNN and inner product primitives.
    • Introduced CPU ISA hints environment variable and API. New API is intended to dispatch function implementations using YMM registers to improve performance on processors with a single Intel® AVX512 compute unit.
    • Improved forward convolution performance for Intel® AVX-512 systems.
    • Improved convolution and batch normalization performance with threadpool.
    • Improved performance of bfloat16 shuffle primitive.
    • Improved performance of `dnnl_gemm` and functionality relying on this implementation for cases with `n=1` on all supported processors.
       

New Functionality

Usability

  • Introduced support for DPC++ debug configuration on Windows

Breaking changes

  • Updated minimal supported CMake version from to 2.8.12 (was 2.8.11)

Known issues and limitations

  • Backward inner product primitive may produce incorrect result for the shapes with number of output channels not been multiple by 16 for future Intel Xeon Scalable processor (code name Sapphire Rapids)
  • Convolution with binary post-op may produce incorrect results for formats with channel padding.
  • Pooling and batch normalization primitives may hang on Windows GEN9 and DG1 in DPC++/L0 configuration.
  • Pooling and batch normalization primitives with 4D double blocked memory formats may produce NaNs or hang on Linux DG1 platforms.
  • See DPC++ limitations that impact the library as well.

Performance Optimizations

  • Reduced overheads associated with primitive cache.
  • Intel® Processor Graphics and Xe architecture-based Graphics:
    • Improved performance of int8 primitives with NHWC activations format.
    • Improved functionality performance for padded memory formats.
    • Improved performance of reorder and shuffle primitives for multiple formats and all dimensions.
    • Improved performance of fp16 pooling primitive.
    • Improved performance of lnorm primitive for plain memory formats.
    • Improved performance of resampling primitive for blocked memory formats .
    • Improved performance of Winograd convolution.
  • Intel® Architecture processors
    • Introduced initial optimizations for bfloat16 functionality for future Intel® Xeon Scalable processor with Intel® AMX support (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
    • Improved performance of int8 compute functionality for future Intel® Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control. 
    • Introduced initial performance optimizations for future Intel® Core processor with Intel® AVX2 and Intel® DL Boost instructions support (code name Alder Lake).
    • Improved performance of int8 primitives for processors with Intel® SSE4.1 instruction set support.
    • Improved performance of int8 and bfloat16 RNN and inner product primitives.
    • Introduced CPU ISA hints environment variable and API. New API is intended to dispatch function implementations using YMM registers to improve performance on processors with a single Intel® AVX512 compute unit.
    • Improved forward convolution performance for Intel® AVX-512 systems.
    • Improved convolution and batch normalization performance with threadpool.
    • Improved performance of bfloat16 shuffle primitive.
    • Improved performance of `dnnl_gemm` and functionality relying on this implementation for cases with `n=1` on all supported processors.
       

New Functionality

  • Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, matmul and reduction (GPU only) along with performance optimizations for CPUs and GPUs. Extended the number of supported post-ops for primitives to 20.
  • Extended eltwise primitive with support for `logsigmoid`, `mish`, `hardswish`, and `clip_v2` algorithms.
  • Introduced support for PRelu primitive
  • Introduced int8 support for LSTM primitive with projection for CPU.
  • Introduced asymmetric quantization support for int8 deconvolution.
  • Extended matmul implementation with support for per-output channel zero-points for quantization.
  • Extended support for broadcasting in binary primitive to both inputs for CPU.
  • Extended binary primitive with support for comparison operators.
  • Introduced float16 support in reduction primitive for GPU.
  • Introduced support for mixed input and output types in binary primitive for GPU.
  • Introduced support for post-ops in GPU resampling implementation.

Usability

  • Added API to enable displaying timestamps in oneDNN verbose mode. Timestamps allow to use oneDNN verbose output in profiling tools.
  • Improved presentation of oneDNN primitives in  Intel® VTune™ Profiler.

Validation

  • Extended benchdnn to report operation bandwidth.
  • Added ability to choose target GPU in benchdnn.

Known issues and limitations

  • When using driver version older than 27.20.100.9316 for Intel® UHD Graphics for 9th Gen Intel® Processor on Windows, convolution/de-convolution functions may sporadically hang or produce incorrect results in DPC++ configuration with LevelZero. Please upgrade your driver version to fix the issue. An alternative solution is to use DPC++ with OpenCL backend with DPC++ compiler.
  • Reorder, prelu, softmax, and pooling primitives on GPUs may be slower for zero padded memory formats than Intel oneDNN 2021.1.
  • Reorder operation for 5D tensor with two dimensions equal to 16 and one uneven dimension can produce incorrect results on Intel® Iris® Xe Max Graphics.
  • Eltwise primitive may produce incorrect results for oneDNN DPC++ configuration with Level Zero runtime. In order to avoid this, use DPC++ with OpenCL backend with DPC++ compiler.
  • Deconvolution primitive may segfault with int8 data on processors for cases with non-trivial padding on processors with Intel AVX-512 support.
  • Deconvolution primitive may segault with int8 data when used with post-ops and per_oc broadcast on processors with Intel AVX2 support.
  • Pooling, batch normalization, and binary primitives may segfault when executed on Xe architecture-based graphics. No workaround available.
  • Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
  • When running GPU kernels that take longer than a certain time (it depends on OS and system settings), you may face a situation resulting in apparent hang of the application. There are ways to configure driver or system settings to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including oneDNN examples:
  • See DPC++ limitations that impact the library as well.

New Functionality

Known issues and limitations

  • Pooling, batch normalization, and binary primitives may segfault when executed on Xe architecture-based graphics. No workaround available.
  • Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
  • When running GPU kernels that take longer than a certain time (it depends on OS and system settings), you may face a situation resulting in apparent hang of the application. There are ways to configure driver or system settings to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including oneDNN examples:
  • See DPC++ limitations that impact the library as well.

 

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.