Authors: Guokai Ma, Hengyu Meng, Jianping Chen, Zhong Cao, Roger Feng, Daisy Deng, Srujana Gattupalli, Keith Achorn, Haihao Shen, Ed Groden, Subha Balaraman, Ramesh Chukka, Koichi Yamada
Artificial intelligence and deep learning workloads have grown in prominence creating a need for CPUs, GPUs and other AI accelerators to accomplish these tasks. Intel provides hardware solutions that range from training massive datasets to extremely low-power silicon for on-device inference to support business-critical needs of cloud service providers, enterprises, and research teams.
For MLPerf Inference v1.1, we submitted results on 3rd Gen Intel® Xeon® Scalable processor, which are the only x86 data center CPUs with built-in AI acceleration, extensive software optimizations for end-to-end data science, and an ecosystem of smart solutions. Our submission covered all the MLPerf data center benchmarks across data types, frameworks, and for AI applications ranging from image processing, natural language processing (NLP), and recommendation systems.
We use Intel® Deep Learning Boost (Intel® DL Boost) technology, including Vector Neural Network Instructions (VNNI), in our INT8-based submissions on all workloads. We have software optimizations that have resulted in 1.86X 5 6 improvement in RNN-T with mixed precision INT8 and bfloat16 quantization resulted in Offline perf vs last round data on 3rd Gen Intel® Xeon® Scalable processors (codenamed “Cooper Lake”). We have seen up to a 2.2X 5 7 performance improvement compared to the last round of MLPerf submissions, such as the Deep Learning Recommendation Model (DLRM). We demonstrated the success of FP32 sparse optimization on DLRM in open submission to showcase 1.6X 7 8 performance compared to dense model. With software optimizations such as model weight sharing, multi-instance SUT in C++, and oneDNN optimizations resulted in 1.3X Offline and 2.3X Server performance improvement 5 7 on popular NLP model BERT-Large compared to v1.0 MLPerf inference submissions.
This blog highlights the software engineering behind the scenes for addressing various optimization opportunities. The optimization techniques we describe benefit the model classes in general, beyond the specific models listed. You can find the implementation details in our latest MLPerf inference v1.1 submission.
Software Optimization Highlights in MLPerf v1.1 Submissions
We use Intel® Deep Learning Boost (Intel® DL Boost) technology, including Vector Neural Network Instructions (VNNI), in our INT8-based submissions on ResNet50-v1.5, SSD-ResNet34, 3D UNET, DLRM, and BERT. We use INT8 and bfloat16 (Mixed precision) in our Recurrent Neural Network Transducer (RNN-T) submissions. The Intel® Low Precision Optimization Tool (Intel® LPOT) now supports low-precision inference on Intel® processors.
Apart from low precision inference, the following optimization techniques are used in the submissions. This is not an exhaustive list, since software optimization for inference is a broad area with many exciting research and innovations in industry and academia.
1. Sparsity demonstrated on DLRM in Open Division
Sparsity is a proven technique to reduce the computation and memory footprints in deep learning optimization. As we all know, accuracy is usually proportional to the randomness of the zero distribution, while performance is the opposite. In our open submission, we kept working on tile-based structured sparsity, which well balances the accuracy and performance. Based on Intel® DL Boost (aka AVX512-VNNI) aware sparsity pattern, we demonstrated the success of INT8 sparse optimization on DLRM.
We trained a sparse DLRM model with zeros distributed in the pattern above with tiles of consecutive 16 non-zero elements in red boxes, while the rest of the tiles are zeros. For this blog, we call this ‘4x16 tile-based sparsity’, which can fully leverage AVX512-VNNI capability. Tile-based sparsity consists of blocks of non-zeros in some regular pattern.
The sparsity ratio of DLRM is 80 percent in geometric mean(geomean). The model has a sparsity ratio of up to a 99 percent for the General Matrix Multiplications (GEMMs) or Linear layer in PyTorch. We developed the sparse GEMM kernel and, using the MLPerf test software, we achieved 1.5-1.6X inference speedup on 3rd Gen Intel Xeon Scalable processors (code name Ice-Lake) compared to the same software with the original dense implementation, while keeping the accuracy loss within 0.5 percent, as shown in Table 1.
2. Optimizations performed with DLRM closed division implementation
- Full INT8 model support by using Vector Neural Network Instructions (VNNI)
- Ops inside interaction layers are fused as interaction kernel and only lower-triangular results for zflat output are computed to improve operation efficiency
- Embedding table lookup and interaction layer is further fused together to reduce memory moving overhead
- Instances and batch-size used are further tuned, and concatenation overhead is reduced
3. Optimizations performed with BERT closed division implementation
- With pre-compensated quantized tensor, equalized performance gain of running symmetric quantized models and asymmetric ones on VNNI.
- Approximate transcendental functions in model using 2nd order polynomials.
- Apply Linear GeLU fusion and fuse quantization/compensation procedure to nearest ops.
- Compile and export inference model using TorchScript infrastructure, a production optimization framework in PyTorch.
- Create a multi-threading based multi-instance SUT in C++, support TorchScript models. Avoid Python framework overhead and heavy inter-process communication. And automatically share model weights amount instances.
4. Optimizations performed with RNN-T closed division implementation
- Enable INT8 LSTM in RNN-T encoder
- Improve oneDNN reorder efficiency between FP32 and S8
- Improve RNN-T decoder implementation by using faster PyTorch OP
- Improve load balancing and pile up situation by further tuning dynamic batching in server scenario
All software used is available in the MLPerf results repo, and optimizations are up-streamed to frameworks.
Intel Submission Result for MLPerf v1.1 Inference
Intel submitted data for all datacenter benchmarks and demonstrated the leading CPU performance and software efficiency across the data center benchmark suite. See our task and platform coverage in Table 2. And see the complete results of Intel submissions on the MLPerf results page.
With our software optimizations based on Intel® DL Boost (VNNI/INT8) ISA, and more memory capacity and bandwidth on Ice Lake, all the workloads achieved 1.5X or greater per socket performance improvement compared to our MLPerf v1.0 submission on 3rd Gen Intel® Xeon® Scalable processors (codenamed “Cooper Lake”).
With software optimizations, we further improved the performance of BERT on Ice Lake with Intel® DL Boost (VNNI/INT8). The queries per second achieved an improvement of1.3X in Offline mode and 2.3X in Server mode 3 9 compared to MLPerf v1.0 submission on the same platform.
On 3rd Gen Intel® Xeon® Scalable processors (code named “Ice Lake”) using Intel® Deep Learning Boost (Intel® DL Boost) (VNNI/INT8) and our software optimizations, the open track DLRM submission with sparsity also saw a performance improvement of 5.8X in Offline mode and 5.2X 10 8 in Server mode compared to last submission with FP32.
On 3rd Gen Intel® Xeon® Scalable processors (codenamed “Cooper Lake”), RNN-T with software optimization and Intel® Deep Learning Boost (Intel® DL Boost) (VNNI/INT8) ISA enabled also achieved a performance improvement of 1.86X in Offline mode and 3X in Server mode 5 6 compared to v1.0 BF16 submission.
Ease of Use for Intel Submission of MLPerf Inference v1.1 Benchmarks
MLPerf is an industry standard benchmark to measure platform AI capabilities for different usage scenarios. To save enabling effort for customers we also provided containers for four PyTorch based workloads. It can help users to enable MLPerf benchmark with minimal effort. Refer to the README.md of each workload at https://github.com/mlcommons/inference_results_v1.1/tree/master/closed/Intel/code to pull MLPerf Intel containers and evaluate the AI workload performance on Intel platforms.
Many techniques, including BF16 precision inference and quantization of attention operators in BERT, have been up-streamed into frameworks such as PyTorch and MXNet. This work lets users enjoy the performance boost today without extra code changes. Refer to the code implementation details in the MLCommons™ Inference v1.1 Results GitHub link for all the software optimizations implemented.
We have more exciting AI-focused technologies in the pipeline. Future Intel Xeon Scalable processors, codenamed Sapphire Rapids, will include Intel® Advanced Matrix Extensions (Intel® AMX). We’re also developing a general-purpose GPU optimized for HPC/AI acceleration based on the Intel® Xe architecture, codenamed Ponte Vecchio. We continue to develop the software and hardware to optimize AI performance on Intel® products, empowering enterprises to deliver on the promise of AI.
Notices and Disclaimers
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.