June 7, 2022 | Intel® VTune™ Profiler
Find and optimize performance bottlenecks fast across CPU, GPU, and FPGA systems.
- Supports DirectML API to pinpoint host-side API call inefficiencies and their causes
- Enables developers to identify memory-transfer-related bottlenecks for GPU computing tasks which use USM extension of OpenCL™ API via analyzing CPU-side stacks.
Learn more at software.intel.com/vtune
June 1, 2022 | oneAPI Specification
Intel is further advancing its support of the oneAPI ecosystem through an agreement to acquire Codeplay Software, a global leader in cross-architecture, open, standards-based developer technologies.
Codeplay is globally recognized for its expertise and leadership in SYCL, the Khronos Group’s open-standard programming model used in oneAPI, and its significant contributions to the industry ranging from open-ecosystem activities like SYCL and OpenCL™ to RISC-V, automotive software safety, and medical imaging.
Codeplay has extensively delivered products supporting diverse hardware platforms globally, embracing the mission of bringing oneAPI to the masses.
Bolstered by the strength of Intel, Codeplay will be able to extend the delivery of SYCL solutions into cross-architecture and multi-vendor products, based on open standards and the open source ecosystems upon which they are built.
May 31, 2022 | Intel @ ISC 2022
At International SuperComputing 2022, Jeff McVeigh, VP of Super Compute Group, highlighted Intel’s HPC leadership technologies that are being used to accelerate innovation for a more sustainable and open HPC-AI, including how:
- Intel software and oneAPI extend across the software stack to provide tools, platforms and software IP to help developers produce scalable, better-performing, more efficient code that take advantage of the latest silicon innovations without the burden of refactoring code.
- Two new Intel oneAPI Centers of Excellence join the ecosystem, bringing the total to 22 universities and labs working across the globe to increase oneAPI capabilities and adoption.
Introducing the New Intel oneAPI Centers of Excellence
- University of Bristol is developing best practices for achieving performance portability at exascale using oneAPI and the Khronos Group* SYCL abstraction layer for cross-platform programming. The goal: ensure scientific codes can achieve high performance on massive heterogeneous supercomputing systems.
- Centre for Development of Advanced Computing (CDAC) is building a base of skilled instructors who deliver oneAPI training to India HPC and AI communities. CDAC will scale training broadly in the country through its infrastructure and teach oneAPI in top universities.
More to Discover
Heidelberg U has recently enabled ROCm support for random number generation and BLAS in Intel® oneAPI Math Kernel Library (oneMKL) interfaces. This is a new and significant community contribution to the oneMKL interfaces project, part of the oneAPI industry initiative that provides SYCL-based APIs for math algorithms focused on CPUs and compute-accelerator architectures.
This work—adding into the project support for rocRAND and rocBLAS—now makes it possible to generate random numbers and perform linear algebra computations using the hipSYCL compiler to achieve near-native performance in cross-platform applications written in hipSYCL. Additionally, it makes oneMKL open-source interfaces the first oneAPI component with upstream support for other SYCL implementations apart from DPC++.
- Learn more of oneAPI specification hipSYCL work at Heidelberg University by reading this blog.
- Learn about Heidelberg University’s engineering vision with the oneAPI project here.
- Understand more about other key contributors to the oneAPI CoE ecosystem here.
- Learn more about the oneAPI initiative at oneapi.io.
- Start developing with the oneMKL open-source interfaces here.
May 25, 2022 | Intel® oneAPI Deep Neural Network Library
In the latest release of TensorFlow 2.9, performance improvements are delivered by Intel® oneAPI Deep Neural Network Library (oneDNN) enabled by Google as the default backend CPU optimization for x86 packages. This applies to all Linux x86 packages and for CPUs with neural-network-focused hardware features like AVX512_VNNI, AVX512_BF16, and AMX vector and matrix extensions found on 2nd gen Intel® Xeon® Scalable processors and newer CPUs.
These optimizations accelerate key performance-intensive operations such as convolution, matrix multiplication, and batch normalization, with up to 3 times performance improvements compared to versions without oneDNN acceleration.
Why It’s Important
While there is an emphasis today on AI accelerators like GPUs for machine learning and deep learning, CPUs remain a primary player in all stages of the AI workflow—ubiquitous across most personal devices, workstations, and data centers. These default optimizations will help enable millions of developers who already use TensorFlow to achieve productivity gains, faster time to train, and efficient utilization of compute.
Performance gains will benefit applications spanning natural language processing, image and object recognition, autonomous vehicles, fraud detection, medical diagnosis and treatment, and more.
Get the Software
- Download oneDNN standalone or as part of the Intel® oneAPI Base Toolkit.
- Download Intel® Optimization for TensorFlow standalone or as part of the Intel® oneAPI AI Analytics Toolkit.
May 15, 2022 | Data Parallel C++/SYCL
Intel recently released an open-source tool to migrate code to SYCL through a project called SYCLomatic; it helps developers more easily port CUDA code to SYCL and C++ to accelerate cross-architecture programming for heterogeneous architectures. This open-source project enables community collaboration to advance adoption of the SYCL standard, a key step in freeing developers from a single-vendor proprietary ecosystem.
How the SYCLomatic Tool Works
SYCLomatic assists developers in porting CUDA code to SYCL, typically migrating 90-95% of CUDA code automatically to SYCL code. To finish the process, developers complete the rest of the coding manually and then custom tune to the desired level of performance.
According to James Reinders, Intel oneAPI evangelist, “Migrating to C++ with SYCL gives code stronger ISO C++ alignment, multivendor support to relieve vendor lock-in, and support for multiarchitecture to provide flexibility in harnessing the full power of new hardware innovations. SYCLomatic offers a valuable tool to automate much of the work, allowing developers to focus more on custom tuning than porting.”
SYCLomatic is a GitHub project. Developers are encouraged to use the tool and provide feedback and contributions to advance the tool’s evolution.
The latest Intel® oneAPI Tools are now available for direct download and/or use in the Intel® DevCloud. This release includes updates to all Toolkits (including 30+ individual tools)—each optimized to deliver improved performance and expanded capabilities for data-centric workloads.
Intel® Arc™ (Discrete) GPUs for Media, Gaming, and AI workloads
- Use cross-architecture Intel® oneAPI software tools to create immersive end-user experiences across technologies, platform capabilities, software, and AI-accelerated processing on the GPU combined with the CPU.
- Delivers up to 50x performance improvement over video-software encode with the industry’s first hardware-accelerated AV1 codec, enabled by Intel® oneAPI Video Processing Library (oneVPL). [Benchmark reference below]
- Includes deep learning support via the oneAPI-powered Intel® Distribution of OpenVINO™ toolkit and Intel® oneAPI Deep Neural Networks Library (oneDNN) as well as performance-tuning insights with Intel® VTune™ Profiler.
- Intel® oneAPI DPC++/C++ Compiler adds more SYCL* 2020 features to improve developer productivity for programming various hardware accelerators such as GPUs and FPGAs, enhances OpenMP* 5.1 compliance, and improves performance of OpenMP reductions for compute offload.
- Intel® Fortran Compiler, based on modern LLVM technology, adds support for parameterized-derived types, F2018 IEEE Compare, and VAX structures support, and expands support for OpenMP 5.0 with Declare Mapper for scalars support.
- oneMKL adds MKL_VERBOSE GPU support for the BLAS Domain and CPU support for the transpose domain for improved visibility during debugging.
- oneCCL now supports Intel® Instrumentation and Tracing Technology profiling, opening new insights with tools such as VTune Profiler.
- oneTBB improves support and use of the latest C++ standard for parallel_sort, plus adds fully functional features for task_arena extension, collaborative_all_once, adaptive mutexes, heterogeneous overloads for concurrent_hash_map, and task_scheduler_handle.
- oneVPL supports multiple hardware adapters and expanded development environments, plus MPEG2 decode in a CPU implementation to improve codec coverage for systems that do not have dedicated hardware.
- Intel® MPI Library enables better resource planning and control at an application level with GPU pinning, plus adds multi-rail support to improve application internode communication bandwidth.
- Intel® Advisor adds user recommendations and sharing, including optimizing data-transfer reuse costs of CPU-to-GPU offloading, details of GPU Roofline kernels and Offload Modeling, and seeing offloaded parts of the code at source level (including performance metrics) in a GPU Roofline perspective.
- Intel® VTune™ Profiler opens the ability to identify performance inefficiencies related to Intel® VT-d for latest-generation server platforms, supports Intel Arc GPUs, and is available as a Docker container.
AI Workload Acceleration
- Intel® Extension for TensorFlow* adds faster model loading, improvements in efficient element-wise Eigen operations, and support for additional fusions such as matmul biasadd-g.
- Additional functionality and productivity for Intel® Extension for Scikit-learn* and Intel® Distribution of Modin* through new features, algorithms and performance improvements such as Minkowski and Chebyshev distances in kNN and acceleration of the t-SNE algorithm.
- Acceleration for AI deployments with quantization and accuracy controls in the Intel® Neural Compressor, making great use of low-precision inferencing across supported Deep Learning Frameworks.
- Support of new PyTorch model inference and training workloads via Model Zoo for Intel® Architecture, extending support to include Python 3.9, TensorFlow v2.8.0, PyTorch v1.10.0, and IPEX v1.10.0.
Scientific Visualization with Rendering & Ray Tracing
- Intel® Open Volume Kernel Library adds support for IndexToObject affine transform and constant cell data for Structured Volumes.
- Intel® OSPRay and Intel® OSPRay Studio now include support for Multi-segment Deformation Motion Blur for mesh geometry, plus new light features and optimizations.
- Intel® Implicit SPMD Program Compiler Run Time (ISPCRT) library is included in the package.
- Intel® FPGA Add-On for oneAPI Base Toolkit enables users to specify an exact, min, or max latency between read and write access on memories and pipes and provides the ability to implement arithmetic floating point operations involving a constant with either DSPs and ALMs or only ALMs.
GROMACS, accelerated by SYCL, oneAPI, and multiarchitecture tools, has strong performance on GPUs based on Intel Xe Architecture
The recent GROMACS 2022 release was extended to multi-vendor architectures, including current and upcoming GPUs based on Intel Xe Architecture.
The team, led by Erik Lindahl from Stockholm University & Royal Institute of Technology, ported GROMACS’ CUDA code, which only runs on Nvidia hardware, to SYCL using the Intel® DPC++ Compatibility Tool; the tool typically automates 90%-95% of the code1,2. The result: A single, portable, cross-architecture-ready code base that significantly streamlines development and provides flexibility for deployment in multiarchitecture environments.
The software’s accelerated compute was made possible by using Intel oneAPI cross-architecture tools—oneAPI DPC++/C++ Compiler, oneAPI libraries, and HPC analysis and cluster tools.
With GROMACS 2022’s full support of SYCL and oneAPI, we extended GROMACS to run on new classes of hardware. We’re already running production simulations on current Intel Xe architecture-based GPUs as well as the upcoming Intel Xe architecture-based GPU development platform Ponte Vecchio via the Intel® DevCloud. Performance results at this stage are impressive – a testament to the power of Intel hardware and software working together. Overall, these optimizations enable diversity in hardware, provide high-end performance, and drive competition and innovation so that we can do science faster, and lower costs downstream.
GROMACS is a molecular dynamics package designed for simulations of protein, lipids, and nucleic acids. Its simulations contribute to the identification of crucial pharmaceutical solutions for conditions such as breast cancer, COVID-19, and Type 2 diabetes, and the international distributed-computing initiative Folding@home.
1The team ported GROMACS’ Nvidia CUDA code to Data Parallel C++ (DPC++), which is a SYCL implementation for oneAPI, in order to create new cross-architecture-ready code.
2Intel estimates as of September 2021. Based on measurements on a set of 70 HPC benchmarks and samples, with examples like Rodinia, SHOC, PENNANT. Results may vary.
If you’re a content creator or game developer, new Intel® Evo™ laptops equipped with Intel Arc A-Series GPUs empower you to create immersive end-user experiences with innovation across technologies, software, and AI-accelerated processing.
And Intel® software tools are a big part of helping developers liberate Intel Arc graphics capabilities and optimize applications for maximum visual performance on the GPU combined with Intel CPUs. Using them, you can:
- Analyze and optimize graphics bottlenecks. Use Intel® Graphics Performance Analyzers to profile graphics and game applications and ramp up profiling abilities with ray tracing, system-level profiling, and Xe Super Sampling (XeSS) capabilities. Capture streams and traces, optimize shaders, and identify the most expensive events with support for multiple APIs (DX, Vulkan, OpenGL, OpenCL, etc.). Download
- Accelerate compute-intensive tasks. Identify the most time-consuming parts of CPU and GPU code. Visualize thread behaviors to quickly find and fix concurrency problems using Intel® VTune™ Profiler. Download
- Speed up media processing and cloud game streaming. Intel® oneAPI Video Processing Library (oneVPL) enables hardware AV1 encode and decode support, and Intel® Deep Link via Hyper Encode APIs, delivering up to 1.4x faster1 single stream transcoding when taking advantage of multiple Intel accelerators in a platform. For content creators already using Handbrake and DaVinci Resolve, oneVPL is integrated into the latest versions. Download
- Integrate AI and machine learning. For game developers, the Intel® Game Dev AI Toolkit delivers a spectrum of AI-powered capabilities, from immersive world creation to real-time game-object-style transfer visualizations. Download
1. Up to 40% higher FPS in video encoding through an internal release of HandBrake on integrated Intel Xe graphics + discrete Intel Arc graphics compared to using Intel Arc graphics alone. Handbrake running on Alchemist pre-production silicon. As of October 2021.
Soda Announces Intel oneAPI Center of Excellence to Support Scikit-learn Performance across Architectures
March 31, 2022 | Intel® Extension for SciKit-learn*
The Social Data research team (Soda) at Inria, France’s national research institute for digital science and technology, is establishing an Intel oneAPI Center of Excellence to focus on developing hardware-optimized performance boosters for scikit-learn, one of the most widely used machine learning libraries.
This scikit-learn extension will deliver more efficient machine learning by using oneAPI numba_dppy or DPC++ components. Additionally, the implementation will be packaged in an independently-managed project possibly maintained by scikit-learn core developers, Intel engineers, and other interested community members.
Heterogenous computing is inevitable. It happens when a host schedules computational tasks to different processors and accelerators like CPUs and GPUs. This partnership will make scikit-learn more performant and energy-efficient on multi-architecture systems.
The Social Data research team specializes in computational and statistical research in data science and machine learning—including scikit-learn optimizations—to harness large databases focused on health and social sciences.
March 10, 2022 | Intel® oneAPI DPC++/C++ Compiler
Now there are more ways to download multi-parallelism-supporting compilers. LLVM-based DPC++/C++/C compilers for Windows* can now be downloaded from the Visual Studio Marketplace.
- Include extensions that support productive development of fast, multicore, vectorized, and cluster-based applications.
- Support the latest C/C++ language and OpenMP* standards.
- Support multiple parallelism models and high-performance libraries including oneTBB, oneMKL, oneVPL, and Intel® IPP.
- Can be used to build mixed-language applications with C++, Visual Basic, C#, and more.
At Intel’s 2022 Investor Meeting, product updates included next-generation Intel® Xeon® and client CPUs and Ponte Vecchio/Arctic Sound-M GPUs that will accelerate data center, AI, and other segment workloads, along with the software to make this all happen.
Intel’s Software-First strategy was noted in Executive Breakout sessions.
- Greg Lavender, Sr. Vice President, CTO, and GM of Intel Software and Advanced Technology Group, discussed in an editorial and presentation how open, standards-based, cross-architecture programming through oneAPI and Intel® oneAPI Toolkits delivers performance and development productivity across advanced architectures.
- Raja Koduri, Sr. Vice President and GM of Intel Accelerated Computing Systems & Graphics Group, outlined the combined power of hardware and software fronting Intel’s Media and HPC-AI Super Compute Strategies. Highlights:
- Intel® Xeon® processors and an open ecosystem, including oneAPI Video Processing Library, Intel® oneAPI AI Analytics Library, and OpenVINO™ toolkit, deliver high-density, real-time broadcast and premium content to meet global demands where 80% of Internet traffic is video.1
- Upcoming Artic Sound-M GPU will deliver a seamless media supercomputer with leadership transcode performance that addresses quality, latency, and density requirements for desktop and cloud gaming, with an AI analytics engine. It will be the industry’s only open-source media solution stack for streaming, gaming, and analytics, and the industry’s first GPU with AV1 encode that delivers over 30% bandwidth improvement at the same quality.2
- Billions of lines are code are optimized for Xeon, which powers 85% of super computers.3 This sets a strong, seamless ecosystem foundation for the fierce combo of Intel Xeon Sapphire Rapids + Ponte Vecchio GPU, where oneAPI unleashes developers to utilize a range of CPUs and accelerators using a single codebase.
- Intel Technology Roadmaps & Milestones
- Intel’s Software Advantage, Decoded
- Software at Intel: Open & Designed with Security in Mind
- Raja Koduri’s Accelerated Computing & Graphics presentation
- oneAPI | Intel® oneAPI Toolkits
1Source Cisco Global 2021 Forecast Highlights
2Source: Mhojhos Research
3Based on TOP500 list over the past decade
February 14, 2022 | Intel® oneAPI Tools
The Technical University of Darmstadt (TU Darmstadt) Embedded Systems and Applications Group announces establishing an Intel oneAPI Center of Excellence (CoE). The center’s objective is to accelerate data parallel computing and simulation software used in medical and pharmaceutical research powered by oneAPI open cross-architecture programming.
Together with Intel, the university will port an accelerated version of the Autodock application to create a single code base that can be efficiently optimized and tuned for multiple hardware architecture targets.
Additionally, TU Darmstadt is working on a next-gen parallel implementation of Autodock-GPU, which aims to speed up drug-discovery simulations by parallel execution across CPUs, GPUs, and FPGAs.
“The new oneAPI Center of Excellence is an exciting step forward for the multiarchitecture SYCL language and oneAPI,” says Joe Curley, vice president and general manager of Intel Software Products and Ecosystem division. “This collaboration with TU-Darmstadt team provides a path for medical and pharmaceutical researchers to use AutoDOCK-GPU productively on the hardware of their choice.”