In this video, Intel cloud software engineer Ben Olson demonstrates how to accelerate deep learning inference by applying default optimizations in TensorFlow* for Intel hardware and quantizing to int8.
When deploying deep learning models, inference speed is usually measured in terms of latency or throughput, depending on your application’s requirements. Latency is how quickly you can get an answer, whereas throughput is how much data the model can process in a given amount of time. Both use cases benefit from accelerating the inference operations of the deep learning framework running on the target hardware.
Engineers from Intel and Google* have collaborated to optimize TensorFlow* running on Intel® hardware. This work is part of the Intel® oneAPI Deep Neural Network Library (oneDNN) and is available to use as part of standard TensorFlow. The demo shows that in addition to importing standard TensorFlow, you only need to set one environment variable to turn on these optimizations:
export TF_ENABLE_ONEDNN_OPTS=1
Note In TensorFlow 2.9, these optimizations are on by default. This setting is no longer needed.
The oneDNN optimizations also take advantage of instruction set features such as Intel® Advanced Vector Extensions 512 (Intel® AVX-512), and Intel® Advanced Matrix Extensions (Intel® AMX).
This demo shows deep learning inference in TensorFlow using a ResNet*-50 model and a synthetic dataset running on an Amazon Web Services (AWS)* m6i.32xlarge instance with an Intel® Xeon® Platinum 8375C CPU running at 2.90 GHz and 512 GiB of memory. This supports VNNI instructions for Intel AVX-512.
For optimizing latency, you typically use a batch size of one image per instance of the model and parallelize the model running across physical cores. Optimizing throughput uses much higher batch sizes, so you need only load the weights once and can use all the available physical cores to parallelize as much as possible. Both use cases benefit from the oneDNN optimizations. The demo runs the model using the latency settings—a batch size of one and one instance across four physical cores—both with and without the VINNI instructions for Intel AVX-512. For this configuration, the VNNI instructions for Intel AVX-512 sped up the demo throughput and latency by about 35%.
To further improve performance, you can quantize the model from 32-bit single-precision floating-point (FP32) to 8-bit integer (int8). This not only reduces the size of the model and weights, but it also allows oneDNN to further parallelize computations using VNNI instructions for Intel AVX-512. Since the Intel AVX-512 instruction set is Single Instruction Multiple Data (SIMD) with 512-bit registers, quantization can increase throughput of these registers from 16 FP32 values at a time to 64 int8 values. VNNI instructions for Intel AVX-512 further accelerate inference by taking advantage of int8 quantization to combine three instructions into one.
The demo uses a script and a configuration file to perform quantization with Intel® Neural Compressor, but this step is not shown due to time constraints. You can try this step, along with other model compression techniques, through the GitHub* repository for the Intel Neural Compressor.
Quantizing to int8 delivers an additional 2x speedup, resulting in an overall speedup of 2.6x compared to the baseline of not using the VNNI instructions for Intel AVX-512, as shown in the following image:

While this demo focuses specifically on speeding up TensorFlow-based deep learning inference by taking advantage of the VNNI instruction set for Intel AVX-512, you can speed up your entire end-to-end AI workflow running on Intel hardware with Intel® AI and machine learning development tools.