Alibaba Group: Zhao Jiang, Senior Engineer; Li Ding, Engineer; Hao Liang, Senior Engineer
Intel Corporation: Pujiang He, Software Architect; Changqing Li, Cloud Software Engineer 


BERT [1] is a key model of Alibaba Cloud Platform for Artificial Intelligence (AI). It is widely used in natural language processing (NLP) tasks for different AI-related services and Alibaba Cloud wants to lower the latency to achieve a better user experience. With the performance advantage of Intel® Deep Learning Boost (Intel® DL Boost) with new brain floating point (bfloat16, BF16) capabilities [2] on 3rd Gen Intel® Xeon® Scalable processor, Alibaba Cloud can optimize inference performance on top AI models for better service experience and lower total cost of ownership (TCO).

The Alibaba Cloud team worked closely with Intel engineers for the BERT inference optimization. Before this BF16 model optimization, we have already done the FP32 BERT model optimization, which fused several BERT layers into one big operator. So this time, we are using this optimized FP32 solution as the perf baseline and focusing on the BF16 enabling work. Through the profiling for the optimized FP32 solution, our investigation showed that over 80% of hot functions used FP32 MatMul and this situation also appears in the Alibaba Cloud. This made it a great candidate for investigating applicability of Intel DL Boost with BF16, which could balance minimal accuracy loss with much higher throughput.  We replaced FP32 MatMul with BF16 MatMul and leveraged oneAPI Deep Neural Network Library (oneDNN) 1.3 on the 3rd Gen Intel Xeon Scalable processor with Intel DL Boost to achieve a 1.83x gain with the BF16 solution. In addition, the BF16 solution achieved the same accuracy with the FP32 solution on the MRPC dataset for classification task (both are 83.59% for a proxy model).

Baseline: Optimized Float32 Bert Model

Let’s have a quick view on the current perf baseline – Optimized float32(FP32) Solution: The Alibaba Cloud BERT model is based on a 12 layer BERT-Base model from google-research/bert [3] and we selected the BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters to be the aligned models.

As shown in Figure 1, to boost the BERT model performance, firstly, we used “Model Transformation” to replace the multi-layer and multi-ops complex BERT model to a single BERT op. The fused part is the most time consuming on the total runtime, and it is a general part on a different standard BERT, so that we can also scale to other BERT models which are based on the same backbone network. Secondly, to implement a highly efficient BERT op on TensorFlow/TensorFlow Serving, we needed to create an optimized kernel, resulting in a new custom BERT op, which is a C++ implementation with ops fusion and optimization.

Figure 1. BERT model optimization solution

In TensorFlow, the “front-end” is responsible for the graph description and “back-end” is responsible for the execution of operators. Thus, we can transform the model from the original BERT model to a new one with BERT op on the front-end and register the new BERT kernel implementation to the backend. As a result, we don't need to reinstall TensorFlow framework. We only need to load the dynamic library implementing the BERT code to respond to the customer quickly, and boost performance as much as possible on the 3rd Gen Intel Xeon Scalable processor by using high performance tools, like Intel® Math Kernel Library (Intel® MKL), oneAPI Deep Neural Network Library (oneDNN) and so on.

For the TensorFlow with new op support, we analyzed the BERT graph to do layer fusion and tensor fusion. Figure 2 shows one-layer details of 12 layers BERT. We can see the three tensors viz. query, key and value fused into one op viz. QKV MatMul & BiasAdd. Also, we omit transpose op with strided MatMul to reduce memory access. For MatMul and BatchMatMul ops on optimized FP32 solution, we enabled the SGEMM function by MKL to get the optimized FP32 BERT model performance as the baseline.

Figure 2. New BERT op implementation

Bfloat16 Bert Model Optimization Solution

Based on the optimized FP32 BERT solution above as the baseline, we found that for Bert, model parameters are huge and fixed on inference process. As shown in Figure 3, the FP32 MatMul data flow graph of BERT has the FP32 weights and the FP32 inputs. When we profile the optimized FP32 BERT model, we noticed the over 80% of runtime during inference is spent in MatMul ops. How to optimize the MatMul op to reduce latency has become one of the most acute challenges.

Figure 3. FP32 MatMul data flow graph of BERT

As we know, reducing memory access, optimizing cache, and improving parallelism can optimize program performance. With the introduction of 3nd Gen Intel Xeon Scalable processors, Intel DL Boost’s bfloat16 instructions as show in Table 1 below, we are able to accelerate the dot product of BF16 by 2 times theoretically compared to the dot product of FP32 [2]. And the bfloat16  We can convert one or two packed float numbers to one packed bfloat16 number, and calculate dot product of two bfloat16 pairs and accumulate the result into one float number.

Table 1. Intel DL Boost bfloat16 instructions




Convert two packed single precision numbers to one packed Bfloat16 number


Convert one packed single precision number to one packed Bfloat16 number


Calculate dot product of two Bfloat16 pairs and accumulate the result into one packed single precision number

In order to reduce memory access, we try to convert FP32 weight to BF16 weight as shown in Figure 4. We convert FP32 weights to BF16 weights and cached the BF16 weights in BF16 MatMul op for reusing it and convert FP32 inputs to BF16 inputs every execution in parallel. Then it can calculate MatMul op using dot product of bfloat16, which is from BF16 inputs to FP32 output. These implementations are all supported by oneDNN. So, we just need to create a new BF16 MatMul op to replace optimized FP32 solution (Baseline) MatMul op, and then we can achieve performance improvement compared FP32 optimization.

Figure 4. BF16 MatMul data flow graph of BERT

For the BF16 optimization solution, on the other hand, we can hold the accuracy as high as possible [4] when the performance is improved by simple op replacement. For BiasAdd op, we still keep FP32 operation to reduce accuracy loss.

Performance and Accuracy Validation

In order to compare three different solutions viz. unoptimized TensorFlow v1.14 (Eigen), optimized FP32, and optimized BF16, we tested on the same 3rd Gen Intel Xeon Scalable processors, as shown in Tables 2 and 3 below. “Default” implies unoptimized TensorFlow* v1.14 (Eigen) solution.

Table 2. Hardware Configuration


Default, FP32 & BF16 Configuration



# Nodes


# Sockets



3rd Gen Intel Xeon Scalable processors

Cores/socket, Threads/socket








BIOS version


System DDR Mem Config: slots / cap / run-speed

24 slots / 16GB / 2933

Total Memory/Node (DDR+DCPMM)


Storage - boot


Storage - application drives



2x Ethernet Controller 10G X550T


CentOS 8.1



Table 3. Software Configuration


Default, FP32 & BF16 configurations


TensorFlow v1.14.0


Customized BERT from Ali


gcc 8.3.1

Libraries (incl. version) e.g MKL DNN, or DAAL

Eigen 3.3, MKL 2020.1.217 vs oneDNN 1.3

Dataset (size, shape)


Precision (FP32, INT8, BF16)

FP32 vs BF16


granularity=fine, compact, 1, 0


0-23, 24-47, 48-71, 72-95



To compare the performance differences between the optimized FP32 Bert and optimized BF16 Bert, we set the batch size as 1 and token size as 128. This also aligns with Alibaba Cloud’s online business. To achieve the lowest latency, we inferenced every BERT model instance on a single socket with 24 cores, and the inputs are the MRPC datasets on TensorFlow v1.14. The optimized FP32 solution at 21.70ms latency is the baseline. The optimized BF16 solution was at 11.83ms latency vs. the baseline, which a 1.83x boost compared to the optimized FP32 solution, as shown in Table 4. Significantly, these performance data are end-to-end data on total model.

Table 4. BERT model inference performance result

BERT model

Average Latency (ms)

Optimized FP32 solution



Optimized BF16 Solution



For evaluating the accuracy of optimized FP32 and optimized BF16, compared to the unoptimized TensorFlow v1.14 (Eigen) solution, we used the MRPC datasets to test accuracy on a proxy model. The results are presented in Table 5. The optimized FP32 solution and the optimized BF16 solution show no accuracy loss versus the unoptimized TensorFlow v1.14 (Eigen) solution.

Table 5. BERT model accuracy result on MRPC datasets

BERT model

Accuracy (Predict Correct/Total)

Unoptimized TensorFlow v1.14 (Eigen) solution

83.59% (1442/1725)


Optimized FP32 solution

83.59% (1442/1725)

No loss

Optimized BF16 solution

83.59% (1442/1725)

No loss


The new bfloat16 capability in Intel DL Boost on 3rd Gen Intel Xeon Scalable platform improved BERT model inference performance 1.83x compared with the optimized FP32 solution with no loss in accuracy. Alibaba Cloud expects these enhancements will help speed up online and offline BERT tasks to provide more efficient services. In the future, the Alibaba Cloud engineering team wants to transfer more FP32 ops to bfloat16 ops to boost performance further. Alibaba Cloud also hopes the optimization solution can scale to other businesses to help improve business performance to support customer needs.

[1] Bert: Pre-training of deep bidirectional transformers for language understanding

[2] Intel Intrinsics Guide


[4] A study of bfloat16 for deep learning training. 

Configuration Details

Alibaba Cloud PAI Customized BERT on TF1.14 Latency Performance on 3rd Gen Intel Xeon Scalable Processor: 

New: Tested by Intel as of 4/23/2020. 4 socket Intel® Xeon® Platinum 83xx(Ali Customized SKU) Processor using Intel Reference Platform, 24 cores HT On Turbo ON Total Memory 384 GB (24 slots/ 16GB/ 2933 MHz), BIOS: WCCCPX6.RPB.0018.2020.0410.1316 (ucode:0x7000017), Storage: Intel SSDPE2KX010T7, NIC: 2x Intel Ethernet Controller 10G x550T, OS: CentOS 8.1, 4.18.0-147.5.1.el8_1.x86_64, Deep Learning Framework: TF1.14, Compiler: gcc 8.3.1, oneDNN version: DNNLv1.3, Customized BERT(Confidential), BS=1, MRPC data, 12 instance/4 socket, Datatype: BF16  

Baseline: Tested by Intel as of 4/23/2020. 4 socket Intel® Xeon® Platinum 83xx(Ali Customized SKU) Processor using Intel Reference Platform, 24 cores HT On Turbo ON Total Memory 384 GB (24 slots / 16GB/ 2933 MHz), BIOS: WCCCPX6.RPB.0018.2020.0410.1316 (ucode:0x7000017), Storage: Intel SSDPE2KX010T7, NIC: 2x Intel Ethernet Controller 10G x550T, OS:CentOS 8.1, 4.18.0-147.5.1.el8_1.x86_64, Deep Learning Framework: TF1.14, Compiler: gcc 8.3.1, MKL version: 2020.1.217, Customized BERT(Confidential), BS=1, MRPC data, 12 instance/4 socket, Datatype: FP32

Notices and Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.  For more complete information visit
Performance results are based on testing as of April 23, 2020 and may not reflect all publicly available ​updates. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.