Softmax is a function used for classification problems in machine learning. It has been broadly applied to image classification in deep learning (where its execution time is small compared with convolution models) and it is now being adopted more frequently in natural language processing (NLP) models. However, without performance optimization, the softmax function may result in higher computation costs for these models.
This blog is based on a paper recently authored by Intel AI researchers Jacek Czaja, Michal Gallus, and Tomasz Patejko and Baidu researcher Jian Tang that presents a methodology of optimizations applied to the Softmax function. The goal of the project was to learn whether Softmax could be optimized to deliver equivalent – and possibly higher – performance through better utilization of a processor’s computing resources. Testing revealed that, in fact, the methods developed to optimize Softmax did produce performance gains.
Starting with Single-thread and PaddlePaddle*
This discussion centers on improvements of the Softmax operation for x86-64 architectures, in particular Intel® Xeon® Scalable processors. Efforts were limited to single-thread execution since the optimization process generally starts with exploiting all the capabilities of a single core.
Testing focused on inference, with a deep attention matching (DAM) model and Baidu’s PaddlePaddle* as the deep learning platform. The Intel® Xeon® Platinum 8180 processor served as the single-core hardware platform.
An open source deep learning framework, PaddlePaddle offers a function to check the execution time of operators, critical in getting performance results from Softmax execution. While optimizing Softmax, the team referred to PaddlePaddle profiling to obtain performance status for both Softmax and the overall DAM model. The team profiled operations in the Softmax profiler to target the most-time consuming ones, and observed that exponential functions execution takes significant time.
Throughout the optimization process, algorithmic modifications were performed to decrease execution time. A key consideration was how best to spare developers the effort of low-level optimizations for the most common mathematical algorithms.
Exponential computations and elementwise division were replaced with BLAS functions provided by the Intel® Math Kernel Library (Intel® MKL). While PaddlePaddle baseline code employs Eigen, a fast and elegant library, Intel MKL provides implementations optimized for x86-64 architecture, and Intel Xeon processors in particular, so it presented an effective alternative. The remaining Eigen code was replaced with a hand-crafted implementation.
Intel MKL functions, accompanied by hand-crafted code, produced a performance improvement of about 2X.1 Figure 1 provides details on the code used to achieve this speedup.
Figure 1: Intel® MKL-based Implementation
The team went further, improving code not already replaced by Intel MKL. Taking advantage of OpenMP, several vector-related operations were optimized. OpenMP simd by itself (hints to loops vectorization) did not provide much of a performance boost. It may result in code size reduction, as the compiler did not have to generate multiple implementations of code when some hints were provided. However, OpenMP simd followed by reduction clause decreased execution time signiﬁcantly.
Line 28 of the code in Figure 2 is the modiﬁcation the team introduced. This optimization brought an additional 5% reduction in time execution. 
Figure 2: Intel® MKL and OpenMP simd-based Implementation
The full paper includes further details on the results of this work on vectorization, including compiler investigations. More detailed information on OpenMP vectorization is also available.
Once it was demonstrated that execution time could be improved, the team sought to find out whether additional work – in this case, more Softmax optimization – could extend the improvement. In the context of the DAM model Softmax was replaced with a memory copying routine (memcpy). The hypothesis was that if the Softmax and memcpy times were close, then the algorithm was likely bound by memory throughput, and performance gains would be unlikely. As it turned out, the baseline implementation, which was not fully vectorized, was far from memory-bound. Figure 3 shows the Softmax execution in DAM models is 2X faster than the original implementation. This optimization impacts performance of the entire DAM model and improves it by over 15%. 
Figure 3: Softmax Implementations Performance Comparison 
This finding underscores the conclusion that performance can be increased through better utilization of the processor’s computing resources. Specifically, the benefits of effective Intel MKL implementations and more effective vectorization were observed.
Given that Softmax is a popular deep learning primitive, these optimizations have been upstreamed into the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). They are available to the public as a model for implementation. The team believes that the optimizations presented here could be transferred to other deep learning frameworks like TensorFlow and PyTorch, and encourages further deep learning optimizations for CPUs.
For more information, review the full report of this work, Softmax Optimizations for Intel® Xeon® Processor-based Platforms. To access the Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN), a performance library for deep learning, go to: https://intel.github.io/mkl-dnn/index.html. For more AI research from Intel, follow @IntelAIDev and @IntelAI on Twitter, and visit ai.intel.com.
Notices and Disclaimers
Optimizations of Softmax using direct implementation in assembly language are not part of PaddlePaddle and Intel MKL-DNN repositories. For measuring performance, we created an integration branch. The experiments were executed using commit ID of the integration branch: 28bba75d9108026f236c312813caf5ba72a6aabe and the following commands:
1 OMP_NUM_THREADS = 1 . / paddle / fluid / inference / tests/ api / test_analyzer_dam \
2 - - infer_model = third_party / inference_demo / dam / model / \
3 - - infer_data = third_party / inference_demo / dam / data.txt \
4 - - gtest_filter = Analyzer_dam . profile - - batch_size = 1 \
5 - - test_all_data = true - - num_threads=1 - - use_analysis = false - - profile
6 echo " ===>␣Batch␣8"
7 OMP_NUM_THREADS = 1 . / paddle / fluid / inference / tests / api / test_analyzer_dam \
8 - - infer_model = third_party / inference_demo / dam / model / \
9 - - infer_data = third_party / inference_demo / dam/ data.txt \
10 - - gtest_filter = Analyzer_dam . profile - - batch_size = 8 \
11 - - test_all_data = true - - num_threads = 1 - - use_analysis = false - - profile
12 echo " ===>␣Batch␣32 "
13 OMP_NUM_THREADS = 1 . / paddle / fluid / inference / tests / api / test_analyzer_dam \
14 - - infer_model = third_party / inference_demo / dam / model / \
15 - - infer_data = third_party / inference_demo / dam/ data.txt \
16 - - gtest_filter = Analyzer_dam . profile - - batch_size = 32 \
17 - - test_all_data = true - - num_threads = 1 - - use_analysis = false - - profile
18 echo " ===>␣Batch␣128 "
19 OMP_NUM_THREADS=1 . / paddle / fluid / inference / tests / api / test_analyzer_dam \
20 - - infer_model = third_party / inference_demo / dam / model / \
21 - - infer_data = third_party / inference_demo / dam/ data.txt \
22 - - gtest_filter = Analyzer_dam . profile - - batch_size = 138 \
23 - - test_all_data = true - - num_threads = 1 - - use_analysis = false - - profile
 Configurations: All measures and performance evaluation as presented in this article were taken using Intel® Xeon® Platinum 8180 processor. Performance results are based on testing by Intel as of 1st of February 2019 and may not reflect all publicly available security updates. No product or component can be absolutely secure.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
Configurations: All measures and performance evaluation as presented in this article were taken using Intel® Xeon® Platinum 8180 processor. Performance results are based on testing by Intel as of 1st of February 2019 and may not reflect all publicly available security updates. No product or component can be absolutely secure.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor dependent optimizations in this product are intended for use with Intel microprocessors.
Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804.
Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation