Speed Up AI Inference without Sacrificing Accuracy
Deploy More Efficient Deep Learning Models
Intel® Neural Compressor performs model compression to reduce the model size and increase the speed of deep learning inference for deployment on CPUs or GPUs. This open source Python* library automates popular model compression technologies, such as quantization, pruning, and knowledge distillation across multiple deep learning frameworks.
Using this library, you can:
Converge quickly on quantized models though automatic accuracy-driven tuning strategies.
Prune model weights by specifying predefined sparsity goals that drive pruning algorithms.
Distill knowledge from a larger network (“teacher”) to train a smaller network (“student”) to mimic its performance with minimal precision loss.
Intel Neural Compressor is available in the Intel® AI Analytics Toolkit (AI Kit), which provides accelerated machine learning and data analytics pipelines with optimized deep learning frameworks and high-performing Python libraries.
Get what you need to build and optimize your oneAPI projects for free. With an Intel® Developer Cloud account, you get 120 days of access to the latest Intel® hardware—CPUs, GPUs, FPGAs—and Intel® oneAPI tools and frameworks. No software downloads. No configuration steps. No installations.
Quantize data and computation to int8, bfloat16, or a mixture of FP32, BF16, and int8 to reduce model size and to speed inference while minimizing precision loss. Quantize during training and posttraining, or dynamically based on the runtime data range.
Prune parameters that have minimal effect on accuracy to reduce the size of a network. Discard weights in structured or unstructured sparsity patterns, or remove filters or layers according to specified rules.
Distill knowledge from a teacher network to a student network to improve the accuracy of the compressed model.
Automatically optimize models using recipes of model compression techniques to achieve objectives with expected accuracy criteria.
APIs for TensorFlow*, PyTorch*, Apache MXNet*, and Open Neural Network Exchange Runtime (ONNXRT) Frameworks
Get started quickly with built-in DataLoaders for popular industry dataset objects or register your own dataset.
Preprocess input data using built-in methods such as resize, crop, normalize, transpose, flip, pad, and more.
Configure model objectives and evaluation metrics without writing framework-specific code.
Analyze the graph and tensor after each tuning run with TensorBoard*.
A 3D Digital Face Reconstruction Solution Enabled by 3rd Generation Intel® Xeon® Scalable Processors
By quantizing the Position Map Regression Network from an FP32-based inference down to int8, Tencent Games* improved inference efficiency and provided a practical solution for 3D digital face reconstruction.
Deploying a trained model for inference often requires modification, optimization, and simplification based on where it is being deployed. This overview of Intel’s end-to-end solution includes a downloadable neural style transfer demonstration.