Q8BERT, a Quantized 8bit Version of BERT-Base


Pre-trained transformer language models (GPT, XLNet, XLM, BERT) have demonstrated State-of-the-Art (SOTA) results for a variety of Natural Language Processing (NLP) tasks such as sentence classification and sequence tagging, by extracting contextual word representations or by fine-tuning the whole model on a target task. The models are pre-trained on extremely large corpora and result in a huge number of parameters. This development will have a major impact on the way business organizations consume computing resources, since inference computing resources will have to handle loading of large models and heavy feed-forward calculations. This will shift workload focus from lower level training to more application specific tweaking. Therefore, it's important to develop energy-efficient and minimum-cost methods to run these models in production.

The authors of the BERT (Bidirectional Encoder Representations from Transformers) language model published two pre-trained models along with their paper: BERT-Base, which has 110M parameters in FP32 representation, and BERT-Large, which has 334M parameters in FP32 representation. Both BERT models have a high memory footprint and require heavy compute during inference. In addition, real time NLP applications that integrate BERT have to meet low latency requirements to achieve a high quality customer experience. The computational characteristics of BERT pose a challenge to deployment in production environment and recently, several methods like quantization, weight pruning and model distillation have been proposed to run BERT inference efficiently.

In this work, we present a method to achieve the best-in-class compression-accuracy ratio for BERT-base. To do this, we applied quantization-aware training during the fine-tuning process of BERT. We quantized all GEMM (General Matrix Multiply) operations in BERT fully connected layers and simulated 8bit quantized inference with FP32 variables while maintaining 99% accuracy in comparison to the FP32 version of BERT-Base for eight different NLP tasks. To reduce BERT memory footprint by approximately 4x and reduce memory bandwidth during inference, the FP32 variables can be easily converted to 8bit representation. In addition, it is possible to use our method to implement efficient inference with hardware that supports 8bit arithmetic and optimized library for 8bit GEMM. We open sourced the quantization method and the code for reproducing simulated 8bit quantized models and have made it available in NLP Architect release 0.5.

How we applied Quantization Aware Training to BERT

We used linear symmetric quantization as our scheme for both activations and weights based on the method proposed by Jacob et al.
Quantize(x, scale, bits) = Clip (Round(x * scale),
- (2bits - 1 - 1), 2bits - 1 - 1)
During the fine tuning phase, we applied fake quantization to the weights and activations in order to simulate the error induced by quantization in the forward pass. When performing back-propagation, we estimated the gradients using a Straight-Through Estimator (STE). Furthermore, we learned the quantization range of the activations while fine-tuning by collecting an exponential moving average of the quantization range.

When running inference, we quantized the activations and weights to Int8 and the biases to Int32 according to the data we collected while training. However, we represented the 8bit values in FP32 variables. As a result of our quantization method, all GEMM operations can be done in Integer arithmetics with 32bit accumulators and then re-quantized back to Int8 values. We noticed that most of the GEMM operations are followed by operations that require high precision, such as layer normalization and Softmax. Therefore, we removed the requantization step after the GEMM in order to avoid further precision loss.


Our code is available in NLP Architect release 0.5 which integrates HuggingFace’s Pytorch-Transformers transformers model repository. It includes several NLP tasks for ease of model training and inference. We expanded on those transformer models and added the quantized BERT base model. In order to achieve that, we replaced all linear and embedding layers of BERT with our own implementation of quantized layers. Using this approach, our quantized base model can be used for training (fine-tuning) any task using both BERT-Base and BERT-Large pre-trained models. Please note that our 8b quantization “recipe” is not limited to the BERT example and could be applied easily via NLP Architect to any other transformer model (supported by HuggingFace’s API).


In order to test our approach we evaluated our model on the GLUE (General Language Understanding Evaluation) benchmark, which is a collection of resources for training, evaluating, and analyzing natural language understanding systems in a wide array of NLP tasks. The ultimate goal of GLUE is to drive research in the development of general and robust natural language understanding systems. In addition, we evaluated our model on the question and answering task SQuADv1.1. The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage.

We have summarized our results for quantized BERT in the following table. We ran each experiment five times and reported the average result and standard deviation. In all the experiments we used BERT-Base as the base model unless indicated otherwise. In all the experiments we fine-tuned the pre-trained model offered by Tensorflow-Hub. In our internal testing, we found that the relative induced error by quantization is less than 1% (excluding RTE task) while the space capacity of the model is reduced by approximately 4x.

Dataset Metric BERT baseline accuracy (STD) Quantized BERT 8bit (STD) Relative Reduction of Accuracy
CoLA* Matthew's corr. 58.48 (1.54) 58.48 (1.32) 0.00%
MRPC F1 90 (0.23) 89.56 (0.18) 0.49%
MRPC-Large F1 90.86 (0.55) 90.9 (0.29) -0.04%
QNLI Accuracy 90.3 (0.44) 90.62 (0.29) -0.35%
QNLI-Large Accuracy 91.66 (0.15) 91.74 (0.36) -0.09%
QQP F1 87.84 (0.19) 87.96 (0.35) -0.14%
RTE* Accuracy 69.7 (1.5) 68.78 (3.52) 1.32%
SST-2 Accuracy 92.36 (0.59) 92.24 (0.27) 0.13%
STS-B Pearson corr. 89.62 (0.31) 89.04 (0.17) 0.65%
STS-B-Large Pearson corr. 90.34 (0.21) 90.12 (0.13) 0.24%
SQuADv1.1 F1 88.46 (0.15) 87.74 (0.15) 0.81%

* Those tasks produce results with high variance in the baseline and 8bits experiments

** Large means those tasks were trained with BERT-Large architecture.

Summary and future work:

We have shown a method for quantizing BERT GEMM operations to 8bit for a variety of NLP tasks with minimum loss in accuracy, and hope that the software developers community can use our quantization method to compress BERT and implement efficient BERT inference with 8bit GEMM operations. Efficient inference will enable low-latency NLP applications on a variety of hardware platforms from devices to data centers. We intend to apply additional software and hardware co-design compression methods on BERT in order to further accelerate BERT inference. Developers and researchers can visit the NLP Architect website and explore our new features for NLP optimization in production and follow us on Twitter for the latest updates from the Intel AI Lab.

We're also excited to announce that our paper on quantized 8Bit BERT has been accepted to the Energy Efficient Machine Learning and Cognitive Computing workshop on Friday, December 13th, co-located with the NeurIPS conference in Vancouver, BC.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Configurations: CPU: 2x Intel® Xeon® Processor E5-2699A v4 @ 2.40GHz; RAM: 251GB System memory; GPU: Nvidia Titan XP; OS: Ubuntu 16.04.1 (4.15.0-50-generic); Software: PyTorch 1.2.0, NLP Architect 0.5.1.

​Performance results are based on testing as of September 2019 and may not reflect all publicly available security updates. No product or component can be absolutely secure. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others. © Intel Corporation

Stay Connected

Keep tabs on all the latest news with our monthly newsletter.