Enterprises are exploring novel ways of providing stellar customer service. Voicebots are delivering just that – high quality customer service, available at any time, from anywhere. Gartner estimates that by 2020, 25% of customer service and support operations will integrate virtual customer assistant technology across engagement channels of voice, chat, and email. And the interactive Voice Response Market is expected to reach 5.54 Billion USD in value by 2023.
One of the first stages of any Voicebot deployment (and the most compute-intensive) is the Automatic Speech Recognition (ASR) process that converts speech to text. The open-source Kaldi Speech Recognition toolkit powers the most widely used ASR services in enterprise deployments today, due to its versatility in handling diverse language models and telephony speech. That’s why we’ve focused on performance improvements for the Kaldi ASR running on Intel Xeon Scalable Processors to help our customers implement Voicebots with real time response capabilities in large scale deployments. We call it the Voicebot Scaling Challenge.
Kaldi Speech Recognition Toolkit
The Kaldi toolkit is very popular in the research community and has become the default toolkit of choice for ASR. In a typical Kaldi ASR pipeline, the input audio signal or waveform is processed to extract a series of features like MFCC, CMVN and I-Vectors, where MFCC/CMVN are used to represent the content of the audio and I-Vectors are used to represent the style of utterance or speaker.
The acoustic model transcribes the extracted features into a sequence of context-dependent phonemes (units of sound that distinguish one word from another in a particular language). Kaldi supports Gaussian Mixture Model (GMM)-based and Deep Neural Network (DNN)- based implementations for acoustic modeling. With advances in AI and deep learning, DNNs are widely replacing GMM-based implementations.
The language model decoder takes the phonemes and turns them into lattices (representations of alternative word-sequences that are likely for a particular audio part). The decoding graph takes into account the grammar of the data, as well as the distribution and probabilities of contiguous specific words (n-grams). In this benchmark, we have used Kaldi’s standard implementation of WFST decoder and compared it with Intel’s optimized decoder. We have also used nnet3-based Time Delayed Neural Network (TDNN) models ASpIRE & Librispeech. The benchmark highlights acceleration options that can significantly boost inference performance.
Intel CPU Performance Optimizations for Kaldi
The entire Kaldi inference pipeline has been optimized for improved performance on Intel processors. Acoustic model optimizations are summarized and details have been covered in earlier publications. The performance of these operations is improved using tools like the Intel Math Kernel Library (Intel MKL) which contains BLAS routines that are specifically optimized for Intel processors and the Deep Neural Network Library (DNNL) for neural network primitives.
Kaldi Decoder Overview
The Decoder takes the scores from acoustic modeling and maps them to lattices or text, based on the language model. The Kaldi toolkit uses Weighted Finite-State Transducer (WFST)-based decoding. In Decoding, a beam search is conducted in a weighted finite state transducer (WFST) that integrates different knowledge sources:
- Hidden Markov Model topology (H)
- Context-dependency (C)
- Pronunciation model (L)
- Language model (G)
During the search phase the acoustic scores are combined with the weights of the HCLG transducer to determine the best-scored word sequences. This process, known as “decoding,” is controlled by a number of parameters, e.g. beam width, acoustic scale factor, and others.
We used the ASpIRE Chain model to evaluate the compute signature of the Kaldi Decoder. For the specific configuration listed below, the Kaldi decoder execution takes about 38% of the overall execution time. This can be even higher based on decoder parameters, vocabulary and lexicon of the language model.
Intel has developed a new decoder library that boosts language modeling performance of the Kaldi ASR Decoder. This library will be available in binary form in a future release of the Intel® Distribution of OpenVINO™ Toolkit. In order to speed up the decoding process, a number of improvements have been applied: for instance, the representation of the WFST is not based on the original Kaldi HCLG WFST, but on a data structure that has been optimized for fast decoding. Furthermore, the search algorithm leverages a combination of beam pruning methods. Beam pruning shrinks the search space by discarding tokens whose score in the previous step was significantly worse than the best score. A token represents a path through the WFST. The computational complexity is reduced significantly by discarding batches of tokens at once instead of individually. Finally, the entire decoding library is not an optimized version of the Kaldi decoder but has been implemented completely from scratch.
As shown in Figure 2, the Intel-optimized decoder takes a much smaller percentage of overall execution time. The performance improvements will vary depending on the complexity and size of the language model. The performance and accuracy of an automatic speech recognition system (ASR) can be measured by two key metrics: Real-Time Factor (RTF) and Word Error Rate (WER). RealTimeX is the reciprocal of RTF. Improvements in RealTimeX should never impact WER integrity.
In real time human-machine interactions that are powered by an online AI inference service, the key service metric has been latency, as batch inference performance claims are not relevant. However, in large scale deployments, the marginal benefits of latency reduction rapidly disappear too. In these use cases, the most useful metric for an online AI inference service like a Voicebot is latency-bound throughput or throughput at small batch sizes.
In real-time Voicebot scenarios, all speech inputs are not available at the same time. Therefore, we chose small-batch throughput test as a representative test to measure the performance of real-time speech transcription where input speech data is available only in very small batches for processing with tighter latency requirements. The test is further categorized into ‘best case’ and ‘worst case’ scenarios.
In the best case scenario test, both the acoustic model and language model are assumed to be fixed and do not change with every incoming audio stream. Time accounted in this case is only the time spent in feature extraction, acoustic model and language model.
In the worst case scenario test, acoustic models and language models are not assumed to be fixed and also include time spent in model loading. ‘Model loading’ refers to the process of fetching the models from storage to CPU or GPU main memory.
We benchmarked throughput tests at small input/batch size on a NVidia Tesla V100 GPU-based system (measured via AWS P3 instance) and an Intel Xeon Gold 6252 processor-based system. Detailed system configuration tables are provided in appendix. On the Intel-based systems, we provided results for both the default Kaldi decoder and an Intel-optimized decoder.
The following figures plot the performance of throughput tests at small batch sizes for ASpIRE and Librispeech models with Librispeech test-clean dataset.
As shown above, in the ASpIRE model the Intel Xeon Gold CPU beats the NVIDIA V100 GPU at a batch size of 1 by 6.8X; when using the Intel-optimized Decoder the CPU performance increases to to 8.6X vs. GPU.  In the Librispeech model, the Intel Xeon CPU has a 11X throughput advantage over the NVIDIA GPU in single batch inference. On a multi-core CPU, the input streams can be processed as soon as they are received without waiting for batching streams.
The ASpIRE model is more complex and a better representation of the production models deployed by our customers. Throughput improvements on a single CPU node can help large production systems deploy AI inference services like a Voicebot without requiring the purchase of additional accelerator hardware.
Kaldi ASR engines power a large majority of enterprise Voicebots in production today. Intel Xeon Scalable processors offer unique performance benefits for this class of workload. In this work, we have focused on measuring latency-bound throughput performance of Kaldi ASR on a single compute node and demonstrated 8.6X faster throughput for ASPiRE Model and 11X faster throughput for Librispeech Model on Intel Xeon Gold CPUs vs. NVIDIA V100 GPUs in single-batch inference.  For enterprises that deploy millions of concurrent Voicebots, these throughput improvements deliver incredible performance and maximize the value of existing large-scale production systems.
Special thanks to Georg Stemmer and Joachim Hofer for their contributions to this blog post.
- Kaldi Speech Recognition Toolkit (Povey et. al., 2012)
- Speaker Adaptive Model for Hindi Speech using Kaldi Speech Recognition toolkit
- Automatic Speech Recognition using the Kaldi Toolkit (Briere, 2018)
- How to Start with Kaldi and Speech Recognition (Ramon, 2018)
- Kaldi ASR: Extending the ASpIRE model (Varga, 2017)
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance results are based on internal testing as of September 2019 and may not reflect all publicly available security updates. No product or component can be absolutely secure.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com. Your costs and results may vary. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice Revision #20110804
 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Refer to http://software.intel.com/en-us/articles/optimization-notice for more information regarding performance and optimization choices in Intel software products.
Kaldi decoder parameters like beam, lattice-beam and max-active were tuned to achieve similar WER % for both CPU and GPU. Additionally, CUDA decoder parameters like max_q_capacity and aux_q_capacity were adjusted accordingly to run the ASpIRE model.
cuda-flag iterations is set to 1 as it does not make sense to process the same input audio for multiple iterations. It improves RealTimeX, but trivializes data movement cost b/w host system memory and GPU memory.
Also, the current implementation of CUDA decoder does not support multiple instances of batched-wav-nnet3-cuda on single AWS instance of NVIDIA V100 GPU. Software utilized for GPU testing was chosen based on information provided by NVIDIA as of Oct 17, 2019.
|Software Configurations||Kaldi ASR - Intel CPU||Kaldi ASR - NVIDIA GPU|
|Other libs used in benchmarks||MKL 2019u2||CUDA 10.1|
|Dataset||Librispeech (test-clean test-other)||Librispeech (test-clean test-other)|
ASpIRE Model config flags
Acoustic Model: ~141MB,
Language Model: ~1020MB
Librispeech Model config flags
Acoustic Model : ~78MB,
language Model: ~192MB
Driver Version: 418.87,
CUDA Version: 10.0
|Intel Decoder Library||Available in future OpenVINO release|
|Hardware Configurations||Intel CPU||NVIDIA GPU|
|CPU||Intel Xeon Gold 6252 CPU @ 2.10GHz||Intel Xeon E5-2686 v4 @ 2.30GHz|
|BIOS version||SE5C620.86B.0D.01.0286.011120190816||4.2, Amazon EC2|
|System DDR Mem Config||12 slots / 16GB / 2933 MHz||4 / 16384 MB / Unknown RAM|
|Total Memory/Node (DDR+DCPMM)||192 GB||128 GB|
|NIC||Intel Ethernet X527DA2OCP||Amazon.com, Inc. Elastic Network Adapter (ENA)|
|Other HW (Accelerator)||-||Tesla V100-SXM2-16GB|
|OS||CentOS-7||Ubuntu 16.04.6 LTS|
Intel, the Intel logo, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Other names and brands may be claimed as the property of others.
© Intel Corporation