Natural Language Processing (NLP) will be a 43 billion dollar business by 2025. Cutting-edge models like Google’s BERT (Bidirectional Encoder Representations from Transformers) are poised to accelerate the adoption of NLP tasks by helping computers understand language more like humans do. At CES 2020, we revealed that our Intel® Nervana™ Neural Network Processor for Inference (NNP-I) performs BERT-base up to 1.6x faster than Nvidia T4 (in 75W envelope). By optimizing BERT and other NLP methods on Intel architecture, customers will be able to efficiently deploy new services and products in this quickly growing market.
Background on BERT
First introduced by Google in 2018, BERT is a pre-trained deep learning model that delivers state-of-the-art results on many natural language processing (NLP) tasks like Q&A, sequence tagging, and sentiment extraction (Fig. 1).
BERT consists of three steps:
- Pre-training learning: This is where the heavy computing occurs. In this step, BERT is pre-trained to solve a masked language model where words are either hidden or swapped with other random words. The training consumes very large amounts of unlabeled textual data (e.g. the entire Wikipedia dataset) in a semi-supervised manner so that no data labeling is needed. This process is done one time to create the model so it can be used to fine-tune learning and then inference.
- Fine-tune learning: This step consists of reasonably fast, supervised learning of specific tasks, such as question answering or sentiment analysis, using a small amount of in-domain labeled data.
- Inference: During inference, the fine-tuned model is loaded, and prediction is invoked. The fine-tuned model is based on the extremely large pre-trained model (BERT-base contains 110 million parameters). Therefore, a very large feed-forward calculation occurs, which is computationally intensive compared to traditional supervised learning inference.
BERT Performance on Intel Nervana NNP-I
A major factor in the 1.6x performance increase over the Nvidia T4 comes from NNP-I hardware architecture and the 8bit quantization (Q8BERT) that Intel researchers presented last fall in achieving the best-in-class accuracy-compression ratio for BERT-base. The recipe is available at Intel’s NLP Architect library that benefits from HuggingFace’s API Transformer.
For this performance benchmark comparison, we optimized BERT implementation for maximum throughput, independent of batch size and latency, but in a similar sub-75W power envelope and PCIe form factor. According to product performance from Nvidia, the T4 has a maximum throughput of 827 sentences/second with batch size equal to 8 as of February 3, 2020. At CES 2020 we presented Intel Nervana NNP-I’s maximum throughput performance of 1334 sentences/second. As of January 20th, 2020, we sped up the throughput to 1560 sentences/second by using additional software optimizations.
Though already best-in-class, we expect the Intel Nervana NNP-I’s throughput performance on NLP tasks to continue improving as the software stack matures and optimizations continue.
Intel Nervana NNP-I + BERT
The Intel Nervana NNP-I has twelve Inference Compute Engines (ICE), each with a high performance matrix multiplication engine that supports 8b quantization and FP16 precision and a highly capable Tensilica Vision Q6 digital signal processor (DSP) with FP16 vector processing units (VPUs) and large local SRAMs. This allows BERT execution to be mapped completely on the ICE cores. The matrix engine achieves up to 92Tops and runs the 8b MLPs.
The Tensilica DSP is a highly programmable and performant VLIW Vector 512b machine that performs the Elementwise, SoftMax, Layer normalization and transpose layers in FP16 precision. The combination of quantization of the MLPs to 8b and the non-GEMM operations to FP16 allows very high performance on Intel Nervana NNP-I.
In addition, the large SRAM inside the ICE enables all intermediate results (between layers) to be stored inside the IP, maintaining data locality and proximity to the execution units. This reduces the external bandwidth and conserves power. The 24MB last level cache (LLC) that is shared across all ICE cores also reduces the bandwidth required for parameter fetch from memory considerably.
For more details on the Intel Nervana NNPI-I solution see our presentation from Hot Chips 2019: “Spring Hill (NNP-I 1000) Intel’s Data Center Inference Chip.”
Batch 2×6 Run on Intel Nervana NNP-I
BERT and similar workloads can be compiled and run on Intel Nervana NNP-I in different modes that target best latency, best throughput or maximal throughput at given latency.
In this blog, BERT measurement refers to a throughput mode of 2×6 illustrated in Figure 3. In this mode, a batch of two is created and run on a pair of Inference Compute Engines (ICE) cores. Since the Intel Nervana NNP-I has 12 ICE cores, six parallel and asynchronous batch two inferences can run on the machine simultaneously. Each of the ICE cores is running one inference of batch 2; this mode provides very high throughput while allowing the SW to run a very low batch.
Continued Improvements for NLP
Though a relatively new method, BERT already is being used in a variety of tasks. Google utilizes BERT in its core search and ranking algorithms to better understand the subtle meanings of words and phrases in searches and match queries with relevant results. Researchers published a paper on aspect-based sentiment analysis using BERT, while others have proposed using BERT to generate multiple choice questions or in quantitative trading algorithms. Because BERT has shown state-of-the-art results in a wide variety of areas, including Q&A, name entity recognition, classification, and more, we expect it to be a crucial component of future NLP tasks.
Intel Nervana NNP-I System Configuration:
PCIe card measurements are based on projections of single chip pre-production NNP-I. Host system: Intel Xeon Silver 4116 CPU @ 2.10GHz. Total memory: 256GB. Tested by Intel as of January 3 and January 20, 2020. Workload: BERT-Base. Data Set: MRPC validation set. Pytorch: 1.1. Sequence Length: 128. Batch Size: 2, 12 instances. Throughput: 1560 sentences/sec. Power Projections of PCIe card: 75W. Precision: Mixed (8-bit/FP16). Nvidia T4 published results as of January 30, 2020.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
Performance results are based on testing as of January 2020 and may not reflect all publicly available security updates. No product or component can be absolutely secure.
Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
© 2020 Intel Corporation