A "Double Play" for MLPerf™ Inference Performance Gains with 3rd Generation Intel® Xeon® Scalable Processors

MaryT_Intel · ‎09-21-2021

Key Takeaways

Intel demonstrates, as the only data center CPU vendor to submit MLPerf inference results on a broad set of models, that it is practical to run deep learning inference anywhere on the massive existing base of Intel Xeon servers alongside other applications. Software optimizations are crucial to delivering the full value of Intel’s hardware advances. In this blog, some of Intel’s software engineers describe the methods they used in optimizing the MLPerf inference 1.1 submissions.
Intel’s recent MLPerf inference v1.1 datacenter submissions affirm Intel’s continued momentum in CPU performance for machine learning. The latest 3rd Gen Intel® Xeon® Scalable processors (codenamed “Ice Lake”) delivers up to a 2.3X Server performance improvement compared to the last round of MLPerf submissions on the BERT [f:MLPerf v0.7 Inference Datacenter Closed ResNet, entry 0.7-101. https://mlcommons.org/en/inference-datacenter-07/] [f:MLPerf v1.1 Inference Datacenter Closed ResNet, entry 1.1-023. https://mlcommons.org/en/inference-datacenter-11/]. Intel’s MLPerf 1.1 results also demonstrate up to 1.5X increase in Offline and 1.65X increase in Server performance compared to 2nd Generation Intel® Xeon® Scalable Processors (codenamed “Cascade Lake”) on ResNet50-v1.5 in MLPerf Inference v0.7 [f:MLPerf v0.7 Inference Datacenter Closed ResNet, entry 0.7-101. https://mlcommons.org/en/inference-datacenter-07/] [f:MLPerf v1.1 Inference Datacenter Closed ResNet, entry 1.1-023. https://mlcommons.org/en/inference-datacenter-11/].
The 3rd Gen Intel® Xeon® Scalable processors (codenamed “Cooper Lake”) delivers up to 3.0X Server performance improvement compared to the last round of MLPerf submissions on the RNN-T model[f:MLPerf v1.0 Inference Datacenter Closed BERT, RNN-T, entry 1.0-20, 1.0-52 https://mlcommons.org/en/inference-datacenter-10/] [f:MLPerf v1.1 Inference Datacenter Closed BERT, RNN-T, entry 1.1-024, 1.1-026 https://mlcommons.org/en/inference-datacenter-11/].

Artificial intelligence and deep learning workloads have grown in prominence creating a need for CPUs, GPUs and other AI accelerators to accomplish these tasks. Intel provides hardware solutions that range from training massive datasets to extremely low-power silicon for on-device inference to support business-critical needs of cloud service providers, enterprises, and research teams.

For MLPerf Inference v1.1, we submitted results on 3rd Gen Intel® Xeon® Scalable processor, which are the only x86 data center CPUs with built-in AI acceleration, extensive software optimizations for end-to-end data science, and an ecosystem of smart solutions. Our submission covered all the MLPerf data center benchmarks across data types, frameworks, and for AI applications ranging from image processing, natural language processing (NLP), and recommendation systems.

We use Intel® Deep Learning Boost (Intel® DL Boost) technology, including Vector Neural Network Instructions (VNNI), in our INT8-based submissions on all workloads. We have software optimizations that have resulted in 1.86X [f:MLPerf v1.0 Inference Datacenter Closed entry 1.0-20 https://mlcommons.org/en/inference-datacenter-10/] [f:MLPerf v1.1 Inference Datacenter Closed RNN-T, entry 1.1-026 https://mlcommons.org/en/inference-datacenter-11/] improvement in RNN-T with mixed precision INT8 and bfloat16 quantization resulted in Offline perf vs last round data on 3rd Gen Intel® Xeon® Scalable processors (codenamed “Cooper Lake”). We have seen up to a 2.2X [f:MLPerf v1.0 Inference Datacenter Closed entry 1.0-20 https://mlcommons.org/en/inference-datacenter-10/] [f:MLPerf v1.1 Inference Datacenter Closed DLRM, entry 1.1-024 https://mlcommons.org/en/inference-datacenter-11/] performance improvement compared to the last round of MLPerf submissions, such as the Deep Learning Recommendation Model (DLRM). We demonstrated the success of FP32 sparse optimization on DLRM in open submission to showcase 1.6X [f:MLPerf v1.1 Inference Datacenter Closed DLRM, entry 1.1-024 https://mlcommons.org/en/inference-datacenter-11/] [f:MLPerf v1.1 Inference Datacenter Open DLRM, entry 1.1-130 https://mlcommons.org/en/inference-datacenter-11/] performance compared to dense model. With software optimizations such as model weight sharing, multi-instance SUT in C++, and oneDNN optimizations resulted in 1.3X Offline and 2.3X Server performance improvement [f:MLPerf v1.0 Inference Datacenter Closed entry 1.0-20 https://mlcommons.org/en/inference-datacenter-10/] [f:MLPerf v1.1 Inference Datacenter Closed DLRM, entry 1.1-024 https://mlcommons.org/en/inference-datacenter-11/] on popular NLP model BERT-Large compared to v1.0 MLPerf inference submissions.

This blog highlights the software engineering behind the scenes for addressing various optimization opportunities. The optimization techniques we describe benefit the model classes in general, beyond the specific models listed. You can find the implementation details in our latest MLPerf inference v1.1 submission.

Software Optimization Highlights in MLPerf v1.1 Submissions

We use Intel® Deep Learning Boost (Intel® DL Boost) technology, including Vector Neural Network Instructions (VNNI), in our INT8-based submissions on ResNet50-v1.5, SSD-ResNet34, 3D UNET, DLRM, and BERT. We use INT8 and bfloat16 (Mixed precision) in our Recurrent Neural Network Transducer (RNN-T) submissions. The Intel® Low Precision Optimization Tool (Intel® LPOT) now supports low-precision inference on Intel® processors.

Apart from low precision inference, the following optimization techniques are used in the submissions. This is not an exhaustive list, since software optimization for inference is a broad area with many exciting research and innovations in industry and academia.

1. Sparsity demonstrated on DLRM in Open Division
Sparsity is a proven technique to reduce the computation and memory footprints in deep learning optimization. As we all know, accuracy is usually proportional to the randomness of the zero distribution, while performance is the opposite. In our open submission, we kept working on tile-based structured sparsity, which well balances the accuracy and performance. Based on Intel® DL Boost (aka AVX512-VNNI) aware sparsity pattern, we demonstrated the success of INT8 sparse optimization on DLRM.

We trained a sparse DLRM model with zeros distributed in the pattern above with tiles of consecutive 16 non-zero elements in red boxes, while the rest of the tiles are zeros. For this blog, we call this ‘4x16 tile-based sparsity’, which can fully leverage AVX512-VNNI capability. Tile-based sparsity consists of blocks of non-zeros in some regular pattern.

	Dense (FP32)	Dense (INT8)	Sparse (INT8)
Accuracy	80.25% (100%)	80.21% (99.96%)	79.91% (99.57%)
Offline QPS	5732	23174(1.0X)	36883(1.6X)
Server QPS	NA	20245(1.0X)	30396(1.5X)
Table 1: Accuracy and performance comparisons between sparse and dense models

The sparsity ratio of DLRM is 80 percent in geometric mean(geomean). The model has a sparsity ratio of up to a 99 percent for the General Matrix Multiplications (GEMMs) or Linear layer in PyTorch. We developed the sparse GEMM kernel and, using the MLPerf test software, we achieved 1.5-1.6X inference speedup on 3rd Gen Intel Xeon Scalable processors (code name Ice-Lake) compared to the same software with the original dense implementation, while keeping the accuracy loss within 0.5 percent, as shown in Table 1.

2. Optimizations performed with DLRM closed division implementation

Full INT8 model support by using Vector Neural Network Instructions (VNNI)
Ops inside interaction layers are fused as interaction kernel and only lower-triangular results for zflat output are computed to improve operation efficiency
Embedding table lookup and interaction layer is further fused together to reduce memory moving overhead
Instances and batch-size used are further tuned, and concatenation overhead is reduced

3. Optimizations performed with BERT closed division implementation

With pre-compensated quantized tensor, equalized performance gain of running symmetric quantized models and asymmetric ones on VNNI.
Approximate transcendental functions in model using 2nd order polynomials.
Apply Linear GeLU fusion and fuse quantization/compensation procedure to nearest ops.
Compile and export inference model using TorchScript infrastructure, a production optimization framework in PyTorch.
Create a multi-threading based multi-instance SUT in C++, support TorchScript models. Avoid Python framework overhead and heavy inter-process communication. And automatically share model weights amount instances.

4. Optimizations performed with RNN-T closed division implementation

Enable INT8 LSTM in RNN-T encoder
Improve oneDNN reorder efficiency between FP32 and S8
Improve RNN-T decoder implementation by using faster PyTorch OP
Improve load balancing and pile up situation by further tuning dynamic batching in server scenario

All software used is available in the MLPerf results repo, and optimizations are up-streamed to frameworks.

Intel Submission Result for MLPerf v1.1 Inference

Intel submitted data for all datacenter benchmarks and demonstrated the leading CPU performance and software efficiency across the data center benchmark suite. See our task and platform coverage in Table 2. And see the complete results of Intel submissions on the MLPerf results page.

Table 2_ MLPerf v1.1 inference submission coverage

With our software optimizations based on Intel® DL Boost (VNNI/INT8) ISA, and more memory capacity and bandwidth on Ice Lake, all the workloads achieved 1.5X or greater per socket performance improvement compared to our MLPerf v1.0 submission on 3rd Gen Intel® Xeon® Scalable processors (codenamed “Cooper Lake”).

Chart 1_ Per-socket Offline Performance Improvement from MLPerf v1.0 using Cooper Lake to MLPerf v1.1 using Ice Lake

Chart 2_ Server Mode Per-socket Performance Improvement from MLPerf v1.0 using Cooper Lake to MLPerf v1.1 using Ice Lake

With software optimizations, we further improved the performance of BERT on Ice Lake with Intel® DL Boost (VNNI/INT8). The queries per second achieved an improvement of1.3X in Offline mode and 2.3X in Server mode [f:MLPerf v1.0 Inference Datacenter Closed BERT, RNN-T, entry 1.0-20, 1.0-52 https://mlcommons.org/en/inference-datacenter-10/] [f:MLPerf v1.1 Inference Datacenter Closed BERT, RNN-T, entry 1.1-024, 1.1-026 https://mlcommons.org/en/inference-datacenter-11/] compared to MLPerf v1.0 submission on the same platform.

Chart 3_ BERT-Large Performance Improvement from MLPerf v1.0 to v1.1

On 3rd Gen Intel® Xeon® Scalable processors (code named “Ice Lake”) using Intel® Deep Learning Boost (Intel® DL Boost) (VNNI/INT8) and our software optimizations, the open track DLRM submission with sparsity also saw a performance improvement of 5.8X in Offline mode and 5.2X [f:MLPerf v1.0 Inference Datacenter Open DLRM, entry 1.0-67 https://mlcommons.org/en/inference-datacenter-10/] [f:MLPerf v1.1 Inference Datacenter Open DLRM, entry 1.1-130 https://mlcommons.org/en/inference-datacenter-11/] in Server mode compared to last submission with FP32.

Chart 4_ DLRM Sparsity Performance Improvement from MLPerf v1.0 to v1.1

On 3rd Gen Intel® Xeon® Scalable processors (codenamed “Cooper Lake”), RNN-T with software optimization and Intel® Deep Learning Boost (Intel® DL Boost) (VNNI/INT8) ISA enabled also achieved a performance improvement of 1.86X in Offline mode and 3X in Server mode [f:MLPerf v1.0 Inference Datacenter Closed entry 1.0-20 https://mlcommons.org/en/inference-datacenter-10/] [f:MLPerf v1.1 Inference Datacenter Closed RNN-T, entry 1.1-026 https://mlcommons.org/en/inference-datacenter-11/] compared to v1.0 BF16 submission.

Chart 5_ RNN-T Performance Improvement from MLPerf v1.0 to v1.1

Ease of Use for Intel Submission of MLPerf Inference v1.1 Benchmarks

MLPerf is an industry standard benchmark to measure platform AI capabilities for different usage scenarios. To save enabling effort for customers we also provided containers for four PyTorch based workloads. It can help users to enable MLPerf benchmark with minimal effort. Refer to the README.md of each workload at https://github.com/mlcommons/inference_results_v1.1/tree/master/closed/Intel/code to pull MLPerf Intel containers and evaluate the AI workload performance on Intel platforms.

Looking Ahead

Many techniques, including BF16 precision inference and quantization of attention operators in BERT, have been up-streamed into frameworks such as PyTorch and MXNet. This work lets users enjoy the performance boost today without extra code changes. Refer to the code implementation details in the MLCommons™ Inference v1.1 Results GitHub link for all the software optimizations implemented.

We have more exciting AI-focused technologies in the pipeline. Future Intel Xeon Scalable processors, codenamed Sapphire Rapids, will include Intel® Advanced Matrix Extensions (Intel® AMX). We’re also developing a general-purpose GPU optimized for HPC/AI acceleration based on the Intel® Xe architecture, codenamed Ponte Vecchio. We continue to develop the software and hardware to optimize AI performance on Intel® products, empowering enterprises to deliver on the promise of AI.

Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

A "Double Play" for MLPerf™ Inference Performance Gains with 3rd Generation Intel® Xeon® Scalable Processors

Key Takeaways