Intel® Neural Compressor
Speed Up AI Inference without Sacrificing Accuracy
Deploy More Efficient Deep Learning Models
Intel® Neural Compressor performs model compression to reduce the model size and increase the speed of deep learning inference for deployment on CPUs or GPUs. This open source Python* library automates popular model compression technologies, such as quantization, pruning, and knowledge distillation across multiple deep learning frameworks.
Using this library, you can:
- Converge quickly on quantized models though automatic accuracy-driven tuning strategies.
- Prune the least important parameters for large models.
- Distill knowledge from a larger model to improve the accuracy of a smaller model for deployment.
- Get started with model compression with one-click analysis and code insertion.
Intel Neural Compressor is part of the end-to-end suite of Intel® AI and machine learning development tools and resources.
Download as Part of the Toolkit
Intel Neural Compressor is available in the Intel® AI Analytics Toolkit (AI Kit), which provides accelerated machine learning and data analytics pipelines with optimized deep learning frameworks and high-performing Python libraries.
Download the Stand-Alone Version
A stand-alone download of Intel Neural Compressor is available. You can download binaries from Intel or choose your preferred repository.
Develop in the Cloud
Build and optimize oneAPI multiarchitecture applications using the latest optimized Intel® oneAPI and AI tools, and test your workloads across Intel® CPUs and GPUs. No hardware installations, software downloads, or configuration necessary. Free for 120 days with extensions possible.
Help Intel Neural Compressor Evolve
This open source component has an active developer community. We welcome you to participate.
Features
Model Compression Techniques
- Quantize activations and weights to int8, bfloat16, or a mixture of FP32, bfloat16, and int8 to reduce model size and to speed inference while minimizing precision loss. Quantize during training, posttraining, or dynamically, based on the runtime data range.
- Prune parameters that have minimal effect on accuracy to reduce the size of a model. Configure pruning patterns, criteria, and schedule.
- Automatically tune quantization and pruning to meet accuracy goals,
- Distill knowledge from a larger model (“teacher”) to a smaller model (“student”) to improve the accuracy of the compressed model.
Automation
- Quantize with one click using the Neural Coder plug-in for JupyterLab and Microsoft Visual Studio* code, which automatically benchmarks approaches to optimize performance.
- Achieve objectives with expected accuracy criteria using built-in strategies to automatically apply quantization techniques to operations.
- Combine multiple model compression techniques with one-shot optimization orchestration.
Interoperability
- Compress models created with PyTorch*, TensorFlow*, or Open Neural Network Exchange (ONNX*) Runtime.
- Configure model objectives and evaluation metrics without writing framework-specific code.
- Export compressed models in PyTorch, TensorFlow, or ONNX for interoperability with other frameworks.
- Validate quantized ONNX models for deployment to third-party hardware architectures via ONNX Runtime.
Case Studies
Accelerating Alibaba* Transformer Model Performance
Alibaba Group* and Intel collaborated to explore and deploy their AI int8 models on platforms that are based on 3rd generation Intel® Xeon® Scalable processors.
CERN Uses Intel® Deep Learning Boost and oneAPI to Juice Inference without Accuracy Loss
Researchers at CERN demonstrated success in accelerating inferencing nearly twofold by using reduced precision without compromising accuracy.
A 3D Digital Face Reconstruction Solution Enabled by 3rd Generation Intel® Xeon® Scalable Processors
By quantizing the Position Map Regression Network from an FP32-based inference down to int8, Tencent Games* improved inference efficiency and provided a practical solution for 3D digital face reconstruction.
Demonstrations
Free Your Code with Neural Coder
Neural Coder is a one-click, no-code solution to optimize deep learning models via model compression. This short demo from the keynote presentation at Intel® Innovation 2022 introduces the solution.
One-Click Acceleration of Hugging Face* Transformers with Neural Coder
Optimum for Intel is an extension to Hugging Face* transformers that provides optimization tools for training and inference. Neural Coder automates int8 quantization using the API for this extension.
Distill and Quantize BERT Text Classification
Perform knowledge distillation of the BERT base model and quantize to int8 using the Stanford Sentiment Treebank 2 (SST-2) dataset. The resulting BERT-Mini model performs inference up to 16x faster.
Quantization in PyTorch Using Fine-Grained FX
Convert an imperative model into a graph model, and perform dynamic quantization, quantization-aware training, or post training static quantization.
AI Inference Acceleration on CPUs
Deploying a trained model for inference often requires modification, optimization, and simplification based on where it is being deployed. This overview of Intel’s end-to-end solution includes a downloadable neural style transfer demonstration.
Accelerate AI Inference without Sacrificing Accuracy
This webinar provides an overview of available model compression techniques and demonstrates an end-to-end quantization workflow.
News
Intel, Habana* Labs, and Hugging Face Advance Deep Learning Software
Learn how this collaboration scales adoption of transformer model training and inference optimized for Intel Xeon Scalable and Habana Gaudi* and Gaudi2 processors.
Alibaba Cloud Integrates Intel Neural Compressor into a Machine Learning Platform for AI
Accelerate deep learning models with one click by using Neural Coder with the Alibaba* Blade-DISC optimization. This solution is integrated in the Alibaba Data Science Workshop development environment.
Deep Learning Model Optimizations Made Easy (or at Least Easier)
Get an introduction to popular model compression techniques. Learn how these techniques can be used to improve performance, reduce the cost, and increase the energy efficiency of deep learning inference.
Documentation & Code Samples
Code Samples
- Get Started
- Model Compression: TensorFlow | PyTorch | ONNX Runtime
- Get Started with Neural Coder
- Accelerate Stable Diffusion with Post-Training Quantization
- Quantizing a DistilBERT Humor Natural Language Processing (NLP) Model
- Accelerate VGG19 Model Inference on 4th Gen Intel Xeon Scalable Processors
Specifications
Processor:
- Intel Xeon processor
- Intel Xeon CPU Max Series
- Intel® Core™ processor
- Intel® Data Center GPU Flex Series
- Intel® Data Center GPU Max Series
Operating systems:
- Linux*
- Windows*
Language:
- Python
Get Help
Your success is our success. Access this support resource when you need assistance.
For additional help, see the general oneAPI Support.
Related Products

Stay Up to Date on AI Workload Optimizations
Sign up to receive hand-curated technical articles, tutorials, developer tools, training opportunities, and more to help you accelerate and optimize your end-to-end AI and data science workflows.
Take a chance and subscribe. You can change your mind at any time.