The vast proliferation and adoption of AI over the past decade has started to drive a shift in AI compute demand from training to inference. There is an increased push to put to use the large number of novel AI models that we have created across diverse environments ranging from the edge to the cloud.
AI Inference refers to the process of using a trained neural network model to make a prediction. AI training on the other hand refers to the creation of the said model or machine learning algorithm using a training dataset. Inference and training along with data engineering are the key stages of a typical AI workflow. The workloads associated with the various stages of this workflow are diverse and no single processor, whether a Central Processing Unit (CPU), Graphics Processing Unit (GPU), Field Programmable Gate Arrays (FPGA) or Artificial Intelligence (AI) accelerator, works best for your entire pipeline.
Let us delve deeper into AI Inference and its applications, the role of software optimization, and how CPUs and particularly Intel® CPUs with built-in AI acceleration deliver optimal AI Inference performance, while looking at a few interesting use case examples.
Not only has my work in AI involved applications in a number of meaningful fields ranging from healthcare to social good, but I have also been able to apply AI to one of my biggest passions — art. I really enjoy combining my hobbies such as painting and embroidery with AI. An example of this is where I was able to use the Neural Style Transfer technique to blend my artwork into the style of famous painters, photos of my friends and pets, or even an Intel microprocessor. We just might have an engaging, hands-on Neural Style Transfer demo for you at the end of the article. Let’s get started!
AI Inference as part of the end-to-end AI workflow
AI, at its essence, converts raw data into information and actionable insights through three stages — data engineering, AI training, and AI inference/deployment. Intel provides a heterogeneous portfolio of AI-optimized hardware combined with a comprehensive suite of AI tools and framework optimizations to accelerate every stage of the end-to-end AI workflow.
Figure 1. Inference as part of the End-to-End AI Workflow
With the amount of focus that has traditionally been paid to training in model-centric AI over the years and the more recent focus on data engineering and data-centric AI, inference can seem to be more of an afterthought. However, applying what is learnt during the training phase to deliver answers to new problems, whether on the cloud or at the edge, is where the value of AI is derived.
Edge inferencing continues to explode across smart surveillance, autonomous machines, and various real-time IOT applications whereas cloud inferencing already has vast usage across fraud detection, personalized recommendations, demand forecasting, and other applications which are not as time-critical and might need greater data processing.
Challenges with AI Inference deployment
Deploying a trained model for inference can seem trivial. This is however far from true as the trained model is not directly used for inference but rather modified, optimized, and simplified based on where it is being deployed. Optimizations depend on performance and efficiency requirements along with the compute, memory, and latency considerations.
The diversity of data and the scale of AI models continues to grow with the proliferation of AI applications across domains and use cases including in vision, speech, recommender systems, and time series applications. Trained models today can be large and complex with hundreds of layers and billions or even trillions of parameters. The inference use case, however, might require that the model still have low latency (ex: automotive applications) or run in a power-constrained environment (ex: battery operated robots). This necessitates the simplification of the trained models even at a slight cost to prediction accuracy.
A couple of popular methods for optimizing a trained model, without significant accuracy losses, are pruning and quantization. Pruning refers to eliminating the least significant model weights that have minimal contribution to the final results across a wide array of inputs. Quantization on the other hand involves reducing the numerical precision of the weights for example from 32-bit float to 8-bit integer.
Accelerating AI Inference: Hardware
The different stages of the AI workflow typically have different memory, compute, and latency requirements. Data engineering has the highest memory requirements so that large datasets can fully fit into systems for efficient preprocessing, considerably shortening the time required to sort, filter, label, and transform your data.
Training is usually the most computationally intense stage of the workflow and typically requires several hours or more to complete based on the size of the dataset.
Inference on the other end has the most stringent latency requirement, often requiring results in milliseconds or less. A point of note here is that while the computing intensity of inference is much lower than that of training, inference is often done on a much larger dataset leading to the use of greater total computing resources for inference vs training.
From hardware that excels at training large, unstructured data sets, to low-power silicon for optimized on-device inference, Intel AI supports cloud service providers, enterprises, and research teams with a portfolio of versatile, purpose-built, customizable, and application-specific AI hardware that turns artificial intelligence into reality.
The role of CPUS in AI
The Intel® Xeon® Scalable processor, with its unparalleled general purpose programmability, is the most widely used server platform from cloud to the edge for AI. CPUs are extensively used in the data engineering and inference stages while training uses a more diverse mix of GPUs and AI accelerators in addition to CPUs. GPUs have their place in the AI toolbox, and Intel is developing a GPU family based on our Xe architecture.
CPUs, however, remain optimal for most ML inference needs, and we are also leading the industry in driving technology innovation to accelerate inference performance on the industry’s most widely used CPUs. We continue expanding the built-in acceleration capabilities of Intel® DL Boost in Intel® Xeon® scalable processors. Based on Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel® DL Boost Vector Neural Network Instructions (VNNI) delivers a significant performance improvement by combining three instructions into one — thereby maximizing the use of compute resources, utilizing the cache better, and avoiding potential bandwidth bottlenecks.
Most recently, we announced Intel® AMX (Intel® Advanced Matrix Extensions), an extensible accelerator architecture in the upcoming Sapphire Rapids CPUs, which enables higher machine learning compute performance for both training and inference by providing a matrix math overlay for the AVX-512 vector math units.
Accelerating AI Inference: Software
Intel complements the AI acceleration capabilities built into our hardware architectures with optimized versions of popular AI frameworks and a rich suite of libraries and tools for end-to-end AI development, including for inference.
All major AI frameworks for deep learning (such as TensorFlow, PyTorch, MXNet, and Paddle Paddle) and classical machine learning (such as Scikit-learn and XGBoost) have been optimized by using oneAPI libraries (oneAPI is a standards-based, unified programming model that delivers a common developer experience across diverse hardware architectures) that provide optimal performance across Intel® CPUs and XPUs.
These Intel software optimizations, referred to as Software AI Accelerators, help deliver orders of magnitude performance gains over stock implementations of the same frameworks. As a framework user, you can reap all performance and productivity benefits through drop-in acceleration without the need to learn new APIs or low-level foundational libraries. Along with developing Intel-optimized distributions for leading AI frameworks, Intel also up-streams our optimizations into the main versions of these frameworks, helping deliver maximum performance and productivity to your inference applications when using default versions of these frameworks.
Deep neural networks (DNNs) show state-of-the-art accuracy for a wide range of computation tasks but still face challenges during inference deployment due to their high computational complexity. A potential alleviating solution is low precision optimization. With hardware acceleration support, low precision inference can compute more operations per second, reduce the memory access pressure, and better utilize the cache to deliver higher throughput and lower latency.
Intel Neural Compressor
The Intel® Neural Compressor tool aims to help practitioners easily and quickly deploy low-precision inference solutions on many of the popular deep learning frameworks including TensorFlow, PyTorch, MXNet, and ONNX runtime. Unified APIs are provided for neural network compression technologies such as low precision quantization, sparsity, pruning, and knowledge distillation. It implements the unified low-precision inference APIs with mixed precision, easy extensibility, and automatic accuracy-driven tuning while being optimized for performance, model size, and memory footprint.
Figure 2. Intel® Neural Compressor Infrastructure
Transformers are deep learning models that are increasingly used for Natural Language Processing (NLP). Alibaba’s end-to-end Machine Learning Platform for AI (PAI) uses Intel-optimized PyTorch transformers for processing real world processing tasks for their millions of users.
Low latency and high throughput are keys to a Transformer model’s success, and 8-bit low precision is a promising technique to meet such requirements. Intel® DL Boost offers powerful capabilities for 8-bit low precision inference on AI workloads. With the support of Intel® Neural Compressor (previously called the Intel® Low Precision Optimization Tool), we can optimize 8-bit inference performance while significantly reducing accuracy loss. You can read more about the partnership with Alibaba and how Intel’s latest CPUs and the Intel Neural Compressor tool helped bring up to a 3x performance boost on the Alibaba PAI blade inference toolkit here.
Intel® Neural Compressor is also an integral part of the Optimum ML Optimization Toolkit from HuggingFace which aims to enable maximum efficiency and production performance to run Transformer models. The Intel® Neural Compressor makes models faster with minimal impact on accuracy, leveraging post-training quantization, quantization-aware training and dynamic quantization. It also helps make them smaller with minimal impact on accuracy, with easy to use configurations to remove model weights. Read more about how one can quantize the BERT model for Intel® Xeon® CPUs here.
Intel® Neural Compressor is available as part of the Intel® oneAPI AI Analytics Toolkit, which provides high-performance APIs and Python packages to accelerate end-to-end machine-learning and data-science pipelines, or as a stand-alone component.
Intel® Distribution of OpenVINO™ Toolkit
The Intel® Distribution of OpenVINO™ toolkit enables practitioners to optimize, tune, and run comprehensive AI inference using an included model optimizer and runtime and development tools. It supports many of the popular AI frameworks including Tensorflow, ONNX, PyTorch, and Keras, and allows for deployment of applications across combinations of accelerators and environments including CPUs, GPUs, and VPUs, and from the edge to the cloud.
Figure 3. Intel® Distribution of OpenVINO™ Toolkit
Developers can explore over 350 pre-trained models that are optimized and hosted on the Open Model Zoo repository including popular models such as YOLO and Mobilenet-SSD for object detection which are optimized with the post-training optimization tool and the performance is benchmarked. Also included are several state-of-the-art models for pose estimation, action recognition, text spotting, pedestrian tracking, scene and object segmentation that can be easily downloaded for immediate use.
To try it, developers can use OpenVINO Notebooks that install OpenVINO locally for rapid prototyping and validating their work loads. You can get started with just your laptop and get a real-time performance boost from our optimized models in less than 15 minutes!
AI Inference Application demo — Neural Style Transfer
Hopefully our discussion today has helped you get a better sense of the Inference stage of the AI workflow, its importance and applications, and how it can be accelerated through both AI-optimized hardware architectures and software tools. Something that has always helped me crystallize concepts is using them in hands-on applications. As mentioned earlier, I love AI and I love to paint. I want to leave you with a quick demo on Neural Style Transfer where I use Intel® CPUs and Intel-optimized TensorFlow to transform my paintings into different styles ranging from Van Gogh’s Starry Night to a design of an Intel Chip and many more!
Figure 4. Neural Style Transfer
Neural Style Transfer is an AI optimization technique that combines your original image with the artistic style of a reference image. Here is a link to all the files, including code and images, that you will need to run your own Neural Style Transfer experiment along with a short video that walks you through all of the steps.
See Related Content
- Migrate Your Existing CUDA Code to Data Parallel C++
- Optimize Edge Compute Performance by Migrating CUDA to DPC++
- Migrate CUDA to DPC++ Using Intel® DPC++ Compatibility Tool
- Migrating the Jacobi Iterative Method from CUDA to SYCL
- Offload Compute-Intensive Code to Intel® GPUs