One-Click PyTorch* Model Compression with Intel® Neural Compressor

Automatic Quantization of PyTorch* Models

Get the Latest on All Things CODE

author-image

By

@IntelDevTools

Authors: Kai Yao, Haihao Shen, and Huma Abidi, Intel Corporation

Intel® Neural Compressor is an open source Python* library for model compression that reduces the model size and increases deep learning inference performance on CPUs or GPUs. It supports post-training static and dynamic quantization of PyTorch* models.

For PyTorch, it is not always trivial to correctly apply deep learning optimizations such as int8 quantization in the Python code. Not only must users insert the corresponding API code correctly in their code, they also must identify the correct variable name of the calibration dataloader and of the model on which the quantization is to be performed. Users might also need to construct the evaluation function for tuning.

To address this issue, Intel Neural Compressor v1.13 provides an experimental auto-quantization feature that allows users to enable quantization without coding. The feature leverages deep learning optimization rules and static program analysis (Python code syntax analysis, static type inference, call graph parsing) that can automatically insert the necessary API code into user scripts.

Auto-Quant Feature: Enable Models in One Click

Overview

The Auto-Quant feature is a code-free solution that automatically enables quantization algorithms in a PyTorch model script and evaluates for the best model performance. Supported features include post-training static quantization, post-training dynamic quantization, and mixed precision. (Refer to GitHub* for a neural coder for quantization.)

The following example code shows how to enable quantization algorithms and performance evaluation on a pretrained ResNet-50 model for ImageNet*:

from neural_coder import auto_quant
auto_quant(
    code="https://github.com/pytorch/examples/blob/main/imagenet/main.py", 
    args="-a resnet50 --pretrained -e /path/to/imagenet/",
)

Python* API for PyTorch Programmers

Auto-Quant can be used as a stand-alone Python library. It offers one-click acceleration of deep learning scripts via automatic platform conversions and optimization code insertions. It subsequently benchmarks applicable optimization sets acquired from the automated enabling to determine best performance.

This feature leverages static program analysis and heuristic deep learning optimization rules to simplify the use of deep learning optimization APIs, which improves developer productivity and facilitates deep learning acceleration. (Learn more here.)

Intel® Neural Compressor Bench GUI

Auto-Quant is also integrated into the Intel Neural Compressor Bench GUI for easy access. First, create a new project, and then upload a PyTorch script (Figure 1).

Figure 1. Open the GUI, and then upload a PyTorch script.

Next, choose an optimization approach (Figure 2).

Figure 2. Choose an optimization approach for the PyTorch script.

Post-training dynamic and static quantization (with the FX backend) are supported in the current Intel Neural Compressor release, with new features under development (Figures 3 and 4). Currently, we construct the evaluate function as a dummy version, so while the performance boost can be demonstrated through the optimization, tuning is bypassed at this stage. However, we will soon support the construction of “real” evaluation functions for the most popular model zoos.

Figure 3. Original PyTorch code for model inference.

Figure 4. Static quantization (FX) optimization result as a patch.

Static quantization quantifies model weights and activations. It allows the user to fuse activations into previous layers when possible. Unlike dynamic quantization, where scales and zeros are collected during inference, static quantization scales, and zeros are determined prior to inference using calibration data sets. Hence, static quantization is theoretically faster than dynamic quantization. As a result, static quantization models are more conducive to inference than dynamic quantization models.

Dynamic quantization weighs the weights of the neural network as integers, but the activation is dynamically quantized during inference (Figure 5). Compared to floating-point neural networks, the size of the dynamic quantization model is much smaller due to the weights being stored as low-bit wide integers. Compared to other quantization techniques, dynamic quantization does not require any data for calibration or fine-tuning.

Figure 5. A dynamic quantization optimization result as a patch.

Users can benchmark the optimizations against the original input model to compare before and after performance of specific optimizations (Figure 6). The GUI design is kept the same as running benchmarks on TensorFlow*/ONNX* (Open Neural Network Exchange) models, so users can enjoy a smooth experience if they have ever used this functionality for TensorFlow/ONNX models.

Figure 6. Perform benchmarking on the model.

Future Work

We welcome any feedback on this feature. Other features are planned and are going to be integrated in the next Intel Neural Compressor release, for example, stock PyTorch int8 enabling similar to Intel Neural Compressor int8, support for random broad models, and more. We also encourage you to check out Intel’s other AI tools and framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI software portfolio.

Download as Part of the Toolkit

Intel® Neural Compressor is available in the Intel® AI Analytics Toolkit (AI Kit), which provides accelerated machine learning and data analytics pipelines with optimized deep learning frameworks and high-performing Python libraries.

Get It Now