Natural language processing (NLP) is a subset of artificial intelligence (AI) technologies that focuses on enabling computers to understand and process human language. Baidu, a leading Chinese-based Internet and AI services company, supports over 100 applications through NLP technologies, with some modules being called up more than 100 billion times per day. An example of Baidu’s use of NLP technologies is their online customer support Chatbot. Baidu’s Chatbot is powered by a Deep Attention Matching (DAM) network model that was developed by Baidu engineers and is based on attention mechanisms. One important task of chatbots is response selection, which selects the best-matched response from a set of candidates by using the context of a conversation.
PaddlePaddle (Parallel Distributed Deep Learning) is a deep learning framework developed by Baidu and widely used in Baidu’s online and offline services and products. As Baidu seeks to integrate their chatbot with their PaddlePaddle framework, Baidu and Intel engineers worked together to optimize the performance of their DAM model on Intel architecture. Following software optimizations by Intel, performance gains on an Intel® Xeon® Gold 6148 CPU based system with PaddlePaddle* are shown in table 1.
Table 1: Table of DAM model inference latency (per sample). Configuration: Intel® Xeon® Gold 6148 CPU @ 2.40GHz. Environment configuration is OMP_NUM_THREADS=1. Baseline benchmark was tested on November 8, 2018 by Intel Corporation. Optimization benchmark was tested on December 12, 2018 by Intel Corporation. Please refer to notices and disclaimers for complete testing configuration.
Model Profiling and Analysis
Intel worked with Baidu to support optimized and intelligent services based on Intel® architecture. In this case, we started our DAM model optimization by analyzing the most time-consuming operators (or “hotspots”). As shown in Figure 1, these were layer_norm, softmax, stack and conv3d. These were our first priority for optimization, as they total more than 80% of all operations in the model.
We followed the overall structure of Baidu’s DAM network model to analyze these operators.
- Representation: Representation consists of a repeatable attentive module (see Figure 2) which captures words and sentences with semantic dependencies. Layer_norm op is used in this repeatable module (where) to prevent vanishing or exploding of gradients (what), whose calculation equation is complex (why).
- Matching: Utterance and response are matched with each other by using a segment-segment similarity matrix that stacks them as the input of 3D convolution (what). Stack op appears in this module (where) and is a memory-level operation (why).
- Aggregation: Finally, DAM aggregates all the segmental matching degrees across each utterance and response into a 3D matching image Q (what) with high-dimension (why), where two layers — conv3d with pool3d (where) — are used in the end of this network, as shown in Figure 3.
Figure 3: Aggregation. 3D Matching image as the input of convolution 3d.
Optimizing Operations through Workload Acceleration
The Intel® Math Kernel Library (Intel® MKL), Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and Intel® Advanced Vector Extensions (Intel® AVX) all contribute to machine-learning workload acceleration. Choosing optimizations wisely can produce the largest performance gains at the op-level, as shown in Table 3 and Figure 4.
After these operators’ optimization, the latency (per sample) of the DAM model decreased about 2.3x. Table 4 shows the model’s performance gain with each operator-level’s optimization.
|Batch size||baseline(ms)||best-fit optimization (ms)||Gain|
Table 4: Model performance gain with ”Best-fit” op-level optimization.Configuration: Intel® Xeon® Gold 6148 CPU @ 2.40GHz. Environment configuration is OMP_NUM_THREADS=1. Baseline benchmark was tested on November 8, 2018 by Intel Corporation. Optimization benchmark was tested on December 12, 2018 by Intel Corporation. Please refer to notices and disclaimers for complete testing configuration.
Library utilization optimization (PR#14437): Softmax
After profiling the softmax op implementation, we found that over 50% of softmax execution time is spent in "exp" part and 30% was spent in "1∕∑". Therefore, we targeted optimization of "ex" followed by summing and elementwise dividing. Intel MKL implements the BLAS and Sparse BLAS routines to optimize these two parts.
Complex calculation op optimization (PR#14417): Layer Normalization
Equation 1 shows the calculation of layer normalization. Intel MKL and Intel MKL-DNN have no directly optimized math functions for this calculation. Although modern compilers produce well-optimized assembly code, we found that Intel AVX improved many deep learning primitives. We also directly used vector instructions which significantly improved the performance of layer normalization by 7X, as shown in Figure 4.
Memory-bound operation optimization (PR#14488): Stack
Stack op serves to stack all of the inputs along one axis. This is a memory copy operation. To optimize these memory-bound operators, the main idea is to decrease the number of write or read memory operations in two ways: 1) make the most of the created memory and 2) utilize the optimized memory function. In this case, we used the “memcpy” function to refactor the implementation of stack for the performance gain shown in Figure 4.
Use Intel MKL-DNN to Further Optimize 3D Convolution
Enhance conv operation with 3D convolution with Intel MKL-DNN
Based on our profiling results, convolution 3d takes up about 9% of the model’s execution time. Intel MKL-DNN is an open source, performance-enhancing library for accelerating deep learning, especially convolution. Therefore, we used Intel MKL-DNN to enhance the conv3d performance. With the help of Intel MKL-DNN, we achieved an almost 4X performance gain for 3D convolution on our Intel Xeon Processor E5-2650 v4 based-platform, as shown in table 5.
|Batch size||PP-MKL-Total time-Baseline(ms)||PP-MKL-DNN-Total time-optimization (ms)||Gain|
Table 5: conv3d op total time comparison before and after optimization. DAM model, Batch size=1. Configuration: Intel® Xeon® Gold 6148 CPU @ 2.40GHz. Environment configuration is OMP_NUM_THREADS=1. Baseline benchmark was tested on November 8, 2018 by Intel Corporation. Optimization benchmark was tested on December 12, 2018 by Intel Corporation. Please refer to notices and disclaimers for complete testing configuration.
|Batch size||DAM with MKL conv3d latency (per sample) (ms)||DAM with MKL-DNN conv3d latency (per sample) (ms)||Gain|
Table 6: model performance gain with conv3d’s optimization.Configuration: Intel® Xeon® Gold 6148 CPU @ 2.40GHz. Environment configuration is OMP_NUM_THREADS=1. Baseline benchmark was tested on November 8, 2018 by Intel Corporation. Optimization benchmark was tested on December 12, 2018 by Intel Corporation. Please refer to notices and disclaimers for complete testing configuration.
Fuse OP to save further
In PaddlePaddle, convolution with bias and elu are calculated with three operations: conv3d op, elementwise add op, and elu op. Since Intel MKL-DNN supports convolution with bias and elu, we can fuse these three operations to conv3d op, which supports the calculation of convolution with bias and relu. This will help decrease the framework overhead.
After applying all of these optimizations, 95% of the operations in the model (by time proportion) are running in the optimized acceleration tool, as seen in the following table:
|Op names||Time proportion in model||Optimization|
|fc||27%||Intel® MKL GEMM|
|softmax||18%||Intel® MKL BLAS|
|layer norm||15%||Math JIT|
|matmul||13%||Intel® MKL Batch GEMM|
|elementwise add||5%||Intel® MKL VADD|
Table 7: list of operators applied the “best-fit” optimizations.
Key Takeaways and Further Thinking
Intel has developed a variety of framework optimizations, tools and software libraries to improve deep learning performance. Relying on one library or one method doesn’t always produce the best performance. For different operators, we chose the best ways to optimize rather than applying one type of optimization for all.
The goal of graph fusion is to minimize unnecessary calculations and memory access for the algorithm. If fusion can decrease time-consuming calculations, it is reasonable. If not, we can skip doing this kind of fusion. For the latest information on performance optimizations from the Intel AI team, follow us on @IntelAIResearch.
Notices and Disclaimers
 Zhou, X., Lu, L., Dong, D., Liu, Y., Chen, Y., Zhao, X., Yu, D. and Wu, H. Multi-turn Response Selection for Chatbots with Deep Attention Machine Network, P18-1103, 2018.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
Configuration: Intel® Xeon® Gold 6148 CPU @ 2.40GHz. Environment configuration is OMP_NUM_THREADS=1. Baseline benchmark was tested on November 8, 2018 by Intel Corporation. Optimization benchmark was tested on December 12, 2018 by Intel Corporation.
Performance results are based on testing as of November 8 and December 12, 2018 and may not reflect all publicly available security updates. No product or component can be absolutely secure.
To reproduce the testing environment, first select the path where you want to store PaddlePaddle, then use the following command to clone PaddlePaddle's source code from github to a folder named Paddle in the local current directory: git clone https://github.com/PaddlePaddle/Paddle.git. Go to the Paddle directory (you can choose to compile with Docker or a local compilation). More details can be found in Paddlepaddle documentation.
Use image provided by Baidu: docker run --name paddle-test -v $PWD:/paddle --network=host -it hub.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
We choose local compilation: cmake -DWITH_TESTING=ON -WITH_FLUID_ONLY=ON -DWITH_GPU=OFF -DWITH_MKL=ON -WITH_SWIG_PY=OFF -DWITH_INFERENCE_API_TEST=ON -DON_INFER=ON ..
Baseline benchmark: commit id: 1001f8e1dbd913a3560f067f39a19f1dde7bae19
Command line to run baseline DAM benchmark on Paddle:
./paddle/fluid/inference/tests/api/test_analyzer_dam --infer_model=third_party/inference_demo/dam/model/ --infer_data=third_party/inference_demo/dam/data.txt --gtest_filter=Analyzer_dam.profile --paddle_num_threads=1 --repeat=5 --batch_size=1 –use_analysis=false --test_all_data
Optimization benchmark: commit id: acc6ae49b18cb55db4dd84cd09069ebe01a1b54a
Command line to run the optimized DAM benchmark on Paddle:
./paddle/fluid/inference/tests/api/test_analyzer_dam --infer_model=third_party/inference_demo/dam/model --infer_data=third_party/inference_demo/dam/data.txt --gtest_filter=Analyzer_dam.profile_mkldnn --paddle_num_threads=1 --batch_size=1 --repeat=5 --test_all_data
Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation