Enhanced low-precision pipeline to accelerate inference with OpenVINO toolkit

MaryT_Intel · ‎02-26-2020

The new release of Intel Distribution of OpenVINO Toolkit features improvements that further optimize deep learning performance on Intel-based platforms.

OpenVINO

Overview

Neural network quantization and execution in low precision have been widely adopted as an optimization method that can achieve significant acceleration while maintaining accuracy. Lowering computational precision to 8 bits can be achieved without model re-training in the majority of cases, and only requires a small fine-tuning step in the remaining ones. Meanwhile, quantization results in substantial speedup and higher system throughput, which is beneficial to deployment use cases.

The Intel® Distribution of OpenVINO™ toolkit is a software tool suite that accelerates applications and algorithms with high-performance, deep learning inference deployed from the edge to the cloud. The previous version of the toolkit, released in October 2019, introduced a new technique for post-training model quantization in order to convert models into low precision without re-training while also improving latency. With the latest release, version 2020.1, we continue to make improvements in both enabling a more streamlined development experience in low-precision quantization and optimizing deep learning performance on Intel® architecture-based platforms. The toolkit speeds up inference on a broad range of Intel hardware: Intel Xeon Scalable and Core CPUs for general-purpose compute, Intel Movidius VPUs for dedicated media and vision applications, and FPGAs for flexible programming logic and scale.

Highlights in these improvements include:

Post-training Optimization Tool, a completely re-designed model calibration tool that allows developers to quantize models in Intermediate Representation (IR) form, an intrinsic representation that’s compatible with the toolkit’s Inference Engine, without the need to fine-tune and double efforts when it comes to deployment in the run-time environment.
Quantization-Aware Training (QAT), a set of third-party components that can produce quantized models compatible with the Intel Distribution of OpenVINO Toolkit. This release introduces a PyTorch-based solution.
Improved INT8 runtime, which correctly interprets quantized models represented in IR, obtained either through post-training optimizations or quantization-aware training, for optimal performance on Intel® architecture.

Neural network quantization

Quantization is the replacement of floating-point arithmetics with integer ones. Since the range of floating-point values is very large compared to the range of actual activations and weights, the typical quantization approach is to represent a range of activation and weight values with a discrete set of points. Consequently, for int8 quantization, the range of values is represented as 256 values in accordance with the number of possible values represented by 8 bits. Several quantization modes are possible, and two of them are considered mainstream (see Fig. 1):

Symmetric – where activation and weight values are distributed equally around zero.
Asymmetric – a.k.a. affine, where values are distributed equally around non-zero value called zero-point.

The main difference between these two modes is that symmetric quantization is more hardware-friendly and produces higher speedup, while the asymmetric one introduces extra computations and requires hardware-specific tweaks and considerations. However, asymmetric mode has the potential to more accurately represent the original range and improve accuracy. In practice, symmetric quantization is considered a baseline for model acceleration on CPU and integrated GPU, while asymmetric can be used in special cases, such as quantizing non-ReLU models (ELU, PReLU, GELU, etc.).

Figure1: Quantization modes.

Post-training Optimization Tool

The Post-training Optimization Tool (POT) is a re-designed version of our previous Calibration tool and will be released in the Intel Distribution of OpenVINO toolkit version 2020.1. The main purpose of this tool is to perform model optimizations after training. As we discussed in previous posts, post-training optimization is attractive due to its streamlined development process, which does not require fine-tuning.

The main objectives that we had for the tool redesign were:

Support for multiple state-of-the-art quantization methods in comparison to the previously supported single method.
Cross-platform flexibility and scalability for Intel architecture, from CPUs, iGPUs, VPUs to FPGAs, in order to achieve optimal performance on each target hardware. This also includes support for symmetric and asymmetric quantization methods depending on target capabilities.
Expansion of support to multiple deep learning workloads, including computer vision, audio, speech, natural language processing, and recommendation systems.
Streamlined development experience with support for and optimizations of models supported, performance maximization, and memory allocation and usage.

The primary goal of this release is INT8 quantization, which is supported by next generation Intel architecture, including Intel Xeon Scalable processors with Intel Deep Learning Boost. INT8 leverages the compounding performance gains of both hardware and software improvements. All quantization features are available on the command line or by using the Intel Distribution of OpenVINO toolkit’s visual interface, called Deep Learning Workbench. The general flow remains the same‒the tool accepts the intermediate representation (IR) of the trained model and the dataset as input, and then produces a quantized IR that can then be consumed by the inference engine in the same way as any other IR. This simplifies the deployment of low-precision applications.

The Post-training Optimization Tool provides multiple quantization and accompanying algorithms which help to restore accuracy after quantizing weights and activations. Potentially, algorithms can form independent optimization pipelines which can be applied to quantize one or multiple models. In the 2020.1 release, we focused on providing two proven combinations of algorithms as defined below:

Default Quantization: used as a default method for 8-bits quantization to get the highest performing model with little accuracy degradation.
Accuracy Aware Quantization: maintains the predefined range of accuracy drop after the quantization produces performance improvements. Please note that it may require more time for quantization than the Default Quantization algorithm.

Below, we give a description of the methods used in these two pipelines.

Default Quantization pipeline

Default Quantization pipeline is designed to do a fast, accurate 8-bit quantization of neural networks. It is a pipeline of three algorithms which are sequentially applied to the model:

Activation Channel Alignment: used as a preliminary step before quantization and allows to align ranges of output activations of convolutional layers in order to reduce the quantization error.
MinMax Quantization: a quantization method that automatically inserts FakeQuantize operations in the model graph based on the specified hardware target and initializes them using statistics collected on the calibration dataset.
Bias Correction: adjusts biases of convolutional and fully-connected layers based on the quantization error of the layer in order to make the overall error unbiased.

Accuracy Aware Quantization pipeline

Accuracy Aware pipeline is designed to perform accurate 8-bit quantization while maintaining a predefined range of accuracy drop, such as 1%. This may cause a degradation in performance in comparison to the default quantization pipeline, because some layers can be reverted back to the original precision. Generally, the pipeline consists of the following steps:

First, the model gets fully quantized using the default quantization pipeline.
Then, the quantized and full-precision models are compared to a subset of the validation set in order to find mismatches in the target accuracy metric. A ranking subset is extracted based on the mismatches.
A layer-wise ranking is performed in order to get a contribution of each quantized layer into the accuracy drop.
Based on this ranking, the most “problematic” layer is reverted back to the original precision. This change is followed by the evaluation of the obtained model on the full validation set in order to get a new accuracy drop.
In the case when accuracy criteria are satisfied for all predefined accuracy metrics, the algorithm finishes. Otherwise, it continues to revert the next “problematic” layer.
It may happen that a regular reverting does not yield any accuracy improvement or even worsens the accuracy. Then the re-ranking is triggered as it described in step 3.

Comparison with other frameworks will be conducted and performance benchmarks will be posted on docs.openvinotoolkit.org.

Quantization-Aware Training (QAT) compatible with the Intel Distribution of OpenVINO toolkit

To provide training capabilities to the OpenVINO community, we are releasing support of low-precision models in the Neural Network Compression Framework (NNCF) which is a part of OpenVINO Training Extensions. These Training Extensions are intended to streamline the development of deep learning models and accelerate the time-to-inference. NNCF is built on top of the PyTorch framework and supports a wide range of DL models for various use cases. It also implements quantization-aware training supporting different quantization modes and settings.

One of the most important features of NNCF is automatic graph transformation when the model is wrapped and additional layers required for quantization-aware fine-tuning are inserted. It helps to simplify the quantization process because the user is not required to be an expert in the quantization flow. In order to modify a custom training pipeline to make it produce a compressed network, some 10-15 lines are typically needed to be added to the user’s PyTorch code. In most cases the model is able to restore the original FP32 accuracy after several epochs of fine-tuning. When fine-tuning finishes, the model can be exported to ONNX format which can be used via regular OpenVINO flow, i.e. Model Optimizer and Inference Engine.

For more details about NNCF QAT and supported models please refer to the framework documentation on GitHub.

Low-precision runtime

Two different quantization paths represent a challenge of unified model representation and execution. OpenVINO represents models quantized through frameworks and via post training quantization with the FakeQuantize primitive. It can represent different types of operations, such as Quantize, Dequantize, Re-Quantize or even QuantizeDequantize, due to its ability to map an input range to an arbitrary output range. It means that most quantized models can be expressed using this operation, no matter whether a model was obtained using QAT or post-training methods.

It is necessary to perform several graph transformation passes on a quantized network to convert it into a form suitable for low precision inference. The FakeQuantize primitive represents two consecutive operations: quantization, which produces integral values in [0, 255] intervals, and dequantization, which returns these values back to floating point range. On the first stage, runtime splits FakeQuantize on these two consecutive operations. The second stage attempts to optimize dequantization by passing it down through the execution graph and fusing with other layers using equivalent mathematical transformations. On the last pass, pattern-specific optimizations are applied. The transformation passes component is common for all target devices, but can be configured by taking into account the features of a particular device.

Conclusion

The new Post-training Optimization Tool in the Intel Distribution of OpenVINO toolkit 2020.1, enables significant acceleration improvement with little or no degradation in accuracy using a model quantization. This enhanced pipeline reduces model size while streamlining the development process with no model re-training or fine-tuning required. Accelerate deep learning inference on Intel architecture platforms today by using the Intel Distribution of OpenVINO toolkit.

_{Legal Information}

_{Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.}

_{Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No product can be absolutely secure.}

_{OPTIMIZATION NOTICE: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #2010804}

_{Intel, the Intel logo, OpenVINO, and other Intel marks are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Other names and brands may be claimed as the property of others. ©Intel Corporation 2020}