Accelerated Development of High-performance GenAI Applications

3/31/2025

Nikita Shiledarbaxi,

Rob Mueller-Albrecht

Software Technical Marketing

Intel Corporation

The latest 2025.0 and 2025.1 releases of Intel’s AI Tools and Frameworks (powered by the oneAPI programming model) introduced several new features and optimizations to aid AI (especially Generative AI or GenAI) developers in accelerating software development and supercharging their workloads. The AI Tools and Frameworks and the Intel® oneAPI Tooklits 2025.1 launched on Monday, March 31.

In a recent webinar called ‘Enhance GenAI Productivity, Acceleration, and Scalability with AI Tools,’ Rob Mueller-Albrecht from Intel discussed what’s new with the latest AI tools and how our AI software development resources focus on providing a seamless developer experience and remarkable performance and code quality improvements on heterogeneous architectures, including CPUs, GPUs, and AI PCs from diverse vendors.

The webinar covered the following topics:

How we embrace open source AI frameworks

Our latest contributions to the PyTorch* community

How the Intel® Neural Compressor enables accelerated inference

About ready-to-use AI Containers and AI Playground, and

How Intel® VTune™ Profiler helps analyze and debug performance issues in AI workloads

Intel® Developer Tools 2025.1 release for faster AI and HPC is now available!

Check out the complete webinar: Enhance GenAI Productivity, Acceleration and Scalability with AI Tools.

This blog will provide highlights of the webinar and show how to get the most out of our latest AI tools and frameworks.

Encouraging Open AI Development

We are actively contributing our optimization know-how and expertise to industry-standard open-source AI frameworks’ communities, including PyTorch, Hugging Face*, Python*, and TensorFlow*. This approach allows developers to streamline their existing AI workflows and easily migrate them to Intel’s latest accelerated hardware without losing compatibility with cross-vendor platforms or abandoning legacy codebases. Our AI tools and libraries support popular LLMs such as Meta* Llama, Microsoft* Phi-3, and Hugging Face Qwen, facilitating easy integration and reuse in your applications.

Our goal in embracing the open software ecosystem is to provide a highly productive development stack for high-performance AI and open-accelerated computing, enabling streamlined adoption with code samples and developer resources.

Our key PyTorch optimizations have been upstreamed to PyTorch versions 2.5 and 2.6, including support for Intel® CPUs and GPUs.

Our TensorFlow optimizations enhance the stock framework for better performance and scalability with the Intel® oneAPI Deep Neural Networks (oneDNN) library.

The JAX Python library benefits from Intel® Distribution for Python optimizations.

Our latest oneAPI and AI Tools support FP16, INT8, and BG16 data types for AI workloads on server CPUs, leading to significant performance gains for Llama inference and compute-intensive math operations with the Intel oneAPI Math Kernel Library (oneMKL).

The AI Tools Selector allows users to create and download custom bundles based on their use cases, leveraging Intel CPU/GPU, OpenVINO™ toolkit, and Intel® Gaudi® AI Accelerators - providing a one-stop-shop for AI development.

Enhancing PyTorch Capabilities: Contributions to the PyTorch Ecosystem

The 2015.1 release of our oneAPI and AI Tools has integrated the Intel® Extension for DeepSpeed*, adding SYCL* kernel support for Intel GPU devices to Microsoft's open-source optimization library DeepSpeed.

The binary distributions of the Intel® Extension for PyTorch are available as open repositories, covering the latest Intel® Core™ and Intel® Xeon® processors and Intel® Data Center GPUs. SYCL-based GPU offload support has been upstreamed to the PyTorch community repository for versions 2.4 and 2.5. Users can switch their PyTorch implementation to benefit from Intel GPUs by changing the tensor device name from cuda to xpu.

In addition to support for a wide range of data types and improved training and inference, PyTorch optimizations include:

Torch and eager compile modes with Hugging Face support,

GenAI LLMs optimized for INT4 and transformer engines supporting FP8, and

Post-training dynamic quantization on CPUs using Torch inductor

Watch the webinar recording from [00:08:30] to learn in detail about the PyTorch updates and community contributions.

PyTorch v2.6 added support for Float16 datatype for neural network inference and training, supported by the recently launched Intel Xeon 6 CPUs with P-cores with Intel® Advanced Matrix Extensions (AMX). GPU optimizations of PyTorch now cover the full Intel® Arc™ Graphics, with simplified installation for Intel GPU software and one-click installation of torch-xpu target support.

Most importantly, x86 CPU support has been added to the TorchInductor CPP backend optimizations of FlexAttention for LLMs.

What is FlexAttention?

FlexAttention, introduced in PyTorch 2.5, supports various attention mechanisms and combinations by leveraging torch.compile to generate a fused FlashAttention kernel, eliminating extra memory allocation and achieving performance comparable to handwritten implementations. It is commonly used in LLM projects like Hugging Face transformers and vLLMs for better performance on x86 CPU platforms.

Attention mechanisms assign weights to model inputs to focus on relevant parts of a dataset. FlexAttention provides pre-optimized attention variants that can be reused with minimal PyTorch code. These mechanisms can be fused and optimized into a FlexAttention kernel using torch.compile, applying holistic optimizations for performance comparable to hand-written code.

From [00:13:35] in the webinar recording, the speaker gives an in-depth explanation of FlexAttention and its benefits. From [00:15:40], he illustrates the performance improvement achieved with the FlexAttention feature on typical Llama models.

Intel’s PyTorch support includes developing hardware and execution flow optimizations, optimizing model deployment with OpenVINO, and upstreaming to the open-source PyTorch community. Key optimizations involve:

Accelerating training and inference with oneDNN,

Supporting distributed multi-node training with Intel® oneAPI Collective Communications Library (oneCCL),

Enhancing parallel execution with Intel® oneAPI DPC++ Library (oneDPL) and high-performance math with oneMKL, and

Utilizing CPU integrated hardware acceleration extensions like Intel® Deep Learning Boost, AVX-512, and AMX.

These optimizations ensure parallel operations, mixed data type precision, faster image-based deep learning, and efficient runtime behavior. The OpenVINO Toolkit further enhances PyTorch models by compressing model size, increasing inference speed, and enabling deployment across various Intel hardware platforms, ensuring compatibility and optimization integration with existing stacks.

Watch the webinar recording from [00:17:20] for more details on our upstreaming and development model for PyTorch.

Intel's GPU support upstreamed into PyTorch provides eager and graph mode support in the PyTorch front end. Eager mode now includes implementations of commonly used Aten operators with SYCL, and additional operators were added in PyTorch 2.6. Performance-critical graphs and operators are optimized using oneDNN and oneMKL. Graph mode (torch.compile) has an enabled Intel GPU backend for optimization and OpenAI* Triton integration. Essential components of Intel GPU support were added in PyTorch 2.4, including Aten operators, oneDNN, Triton, Intel GPU source build, and chains integration.

The architecture roadmap for upstreamed PyTorch support was discussed in the webinar recordingbeginning at [00:21:40].

Our PyTorch optimizations are supported by efforts to enhance Python performance. The Intel® Distribution for Python and Data Parallel Extension for Python enable high-performance machine learning and scientific applications through optimized heterogeneous computing. Benefits include scalability, near-native performance, productivity tools, and essential Python bindings.

New Python features introduced with our latest AI tools release include:

New sorting and summing functions,

NumPy* 2.0 support and compatibility with NumPy 2.2.3,

Asynchronous execution of offloaded operations, and

90% functional compatibility with CuPy*.

Improved Training and Inference with Intel® Neural Compressor and AI Performance Libraries

The Intel Neural Compressor is an open-source Python library designed to enhance inference efficiency by optimizing model size and speed for deployment on CPUs, GPUs, and Intel Gaudi AI accelerators. It automates model optimization techniques such as quantization, pruning, and knowledge distillation across multiple deep learning frameworks, providing state-of-the-art low-bit LLM quantization and sparsity exploitation. The latest release of the tool introduced a transformer-like quantization API for weight-only quantization on LLMs and INT4 quantization for visual language models (VLMs). It expanded support for Intel Arc B Series Graphics GPUs and Intel Gaudi AI Accelerators.

oneDNN focuses on additional optimizations and performance tuning for the latest Intel Xeon Processors and a wider range of Intel Arc Graphics. Enhancements include matrix multiplication and convolution performance, targeting data center AI workloads with Intel AMX instruction set. AI inference on client CPUs sees improved performance with oneDNN on Intel Arc Graphics, Intel Core Ultra processors, and Intel Arc B-series discrete graphics. Optimizations for Gated Multi Level Perceptron (Gated MLP) and Scaled Dot-Product Attention (SDPA) with implicit casual mask and support for int8 or int4 compressed key and value through the Graph API enhance inference speed and efficiency.

oneCCL optimizes and scales inference and training for large datasets, supporting high-performance distributed computing with MPI and lib-fabrics standards. It balances compute and communication performance, enabling efficient implementations of collectives used in neural network training. The AI Tools 2025.1 release introduces additional control over collective communications with extensions to the Group API and optimizations for all-to-all operations, enhancing scalability and performance across common network topologies.

See the webinar replay from [00:28:25] to learn more about newly added features in Intel Neural Compressor and oneAPI-powered performance libraries.

Experiment with the AI Playground and the AI Tools Containers

Intel offers AI Containers and the AI Playground for new AI projects or porting existing ones. The AI Playground is an open-source project hosted on GitHub, providing tools for AI image creation, stylizing, and chatbot applications using Intel® Arc™ GPU and libraries from GitHub and Hugging Face. It includes a wide range of Gen AI libraries and models and a ready-to-use installer executable.

The AI Containers repository offers templates and containers for scaling AI workloads optimized for Intel platforms using TensorFlow and PyTorch. Scaling is achieved with Python, Docker, Kubernetes, Kubeflow, cnvrg.io, Helm, and other orchestration frameworks for cloud and on-premise environments. Containers are stored on Microsoft* Azure but can be used with open-source registries.

Watch the webinar recording from [00:36:00] to learn more about the AI Playground and the AI Tools Containers, as well as the steps to set up and deploy the containers.

Profile Performance on AI PC NPUs with Intel® VTune™ Profiler

AI PC NPUs, or Neural Processing Units, are specialized processors optimized to accelerate AI workloads such as natural language processing (NLP), image recognition, and other neural network-based tasks. They are ideal for low-power AI tasks over extended periods, such as webcam background blur and image segmentation.

Intel Core Ultra processors have integrated NPUs that you can leverage for accelerated your workloads on AI PCs. The NPU Exploration Analysis feature of the VTune Profiler tool can help you enhance your application’s performance by identifying bottlenecks such as NPU bandwidth utilization, most time-consuming tasks, and much more. It helps you ensure that the power of NPU is utilized as efficiently as possible.

Watch the webinar recording from [00:40:25], where the presenter demonstrates how you can profile the performance of an Intel AI PC project using VTune Profiler.

What’s Next?

Stay up to date with our oneAPI and AI software tools updates!

Check out the full webinar recording and jump-start AI development with our AI tools, libraries, and frameworks. You can also experiment with our oneAPI and AI tools on Intel’s latest accelerated hardware on the Intel® Tiber™ AI Cloud platform.