Tencent: Qiao Tian, Senior Researcher; Linghui Chen, Principal Researcher; Heng Lu, Principal Researcher; Dong Yu, Distinguished Scientist

Intel Corporation: Pujiang He, Software Architect; Ethan Xie, Senior Cloud Software Engineer; Jianyan, Lv, Senior Cloud Software Engineer; Ciyong Chen, Senior Deep Learning Software Engineer

Neural based vocoders, including WaveNet, Parallel WaveNet, WaveRNN, LPCNet and Multiband WaveRNN [1], are being proposed for sequence-to-sequence acoustic models to improve the quality of TTS. WaveNet vocoder can generate high-fidelity audio, but its huge computational complexity limits its deployment for real-time services. LPCNet vocoder uses the linear predictive characteristic of speech signal processing in the WaveRNN architecture. It generates high-quality speech that is faster than real time on a single processor core. However, it is still not efficient enough for online speech generation tasks.

As a result, Tencent AI Lab and Cloud Xiaowei initiated a new neural vocoder FeatherWave [2] based on the multiband linear predictive of WaveRNN. It can significantly help improve speech synthesis efficiency. Working in collaboration with an Intel engineering team, Tencent engineers incorporated optimizations for 3rd Gen Intel Xeon Scalable processors, formerly codenamed “Cooper Lake”. They used the new brain floating point (bfloat16) capability in Intel Deep Learning Boost (Intel DL Boost). Optimizations resulted in generating text-to-speech 1.54x faster than FP32 with the same quality level (MOS4.5).

Tencent also presented an improved model derived from GAN and Parallel WaveNet (PWaveNet)[3]. When running PWaveNet on the company’s current 2nd Gen Intel Xeon Scalable processor deployment, the performance doesn’t meet the company’s real-time requirements. Tencent and Intel optimized performance on 3rd Gen Intel Xeon Scalable processors and achieved a 1.89x performance speed-up versus FP32. All while keeping the same quality level (MOS4.4).

In this blog, we introduce the optimization techniques and performance results for the customized WaveRNN and PWaveNet models running on the 3rd Gen Intel Xeon Scalable processor family. Intel Xeon Scalable processors are built to run complex artificial intelligence (AI) workloads, taking embedded AI performance to the next level with Intel DL Boost. Intel Advanced Vector Extension 512 (Intel AVX-512) and Vector Neural Network Instructions (VNNI) are already supported. With the introduction of 3rd Gen Intel Xeon Scalable processors, Intel DL Boost now includes bfloat16. Bfloat16 is a short version of FP32, which provides a larger range for deep learning workloads compared with half float precision (FP16). It is more convenient than INT8 as it does not require quantizing/dequantizing with calibration data.

Intel DL Boost now includes the following bfloat16 instructions:




Convert two packed single precision numbers to one packed Bfloat16 number


Convert one packed single precision number to one packed Bfloat16 number


Calculate dot product of two Bfloat16 pairs and accumulate the result into one packed single precision number

Customized WaveRNN

Figure 1 illustrates the structure of the Tencent proposed FeatherWave vocoder [2]. The mel spectrograms are widely used in neural TTS systems. They’re used as the input to the condition network, while the sample rate network will produce several samples with a multidual softmax output layer. Like the original WaveRNN, the sampling network first predicts the coarse part of the excitation signal. It then computes the fine part by conditioning on the predicted coarse signal. The subband signal is predicted from the network output excitation and linear predicted signal. Then the merge band operation is applied to reconstruct the original waveform signal from the predicted signal of subbands.

The optimizations discussed here use several optimization techniques from general to High Performance Computing (HPC) workloads. All the optimizations were performed with the customized WaveRNN model, which is like FeatherWave vocoder and without multiband linear prediction. For FeatherWave, we would expect similar performance improvements.

Optimization Approach

1.       Take advantage of Intel AVX-512 and bfloat16 instructions
Ensure all the SGEMV computation from Coarse part/Fine part within GRU module and Dense operator are vectorized with 512-bits and adopting bfloat16 dot product instructions. And for those operations including element wise addition/multiplication, and other non-linear activations are running with the latest Intel AVX-512 instructions.

2.       Better memory allocator with alignment
Memory alignment is critical to achieve the best performance with AVX-512 instructions. Changing all the memory allocator to use `_mm_malloc(size, 64)` ensures the memory buffer used in the workload is aligned to 64 bytes.

3.       Cache blocking techniques
Reconstruct the weight of SGEMV operation (matrix) to the special block format to improve the spatial locality compared to the plain format (column-major). Tile the input data of SGEMV operation (vector) to improve the temporal locality.

4.       Op fusion to reduce the amount of memory access
There’s a large amount of element wise operation inside the Coarse and Fine part to update all three GRU gates status. Fusing those element wise operations to keep most of the data in the register/cache instead of load/store from/to DRAM, which results in a performance boost.

5.       Balance Sparse SGEMV when scaling up with more cores
Sparse models have much better quality than small dense models, which is adopted by customized WaveRNN. Balancing the amount of computation is critical when scaling up with more cores. It’s also critical to dispatch the workload as evenly as possible to different cores (this optimization is only effective for multi-core environment).

Performance Results
The following performance results were obtained with customizing WaveRNN on 3rd Gen Intel Xeon Scalable processors.

Table 1. Hardware Configuration


FP32 & BF16 Configuration



# Nodes


# Sockets



3rd Gen Intel Xeon Scalable processors

Cores/socket, Threads/socket








BIOS version


System DDR Mem Config: Slots / cap / run-speed

24 slots / 16GB / 2933

Total Memory/Node (DDR+DCPMM)


Storage - boot


Storage - application drives



2x Ethernet Controller 10G X550T


CentOS 8.1



Table 2. Software Configuration


FP32 and BF16 configurations


MXNet V1.7


Customized WaveRNN


gcc 8.3.1

Libraries (incl. version) e.g MKL DNN, or DAAL

oneDNN 1.3

Dataset (size, shape)

Customer provided dataset

Precision (FP32, INT8, BF16)

FP32 vs BF16


Not applicable


One core per instance, and use numactl to bind to different core



Table 3. Performance Results

Test Case

FP32 Solution

BF16 Solution

Performance Gain



Batch size

Total Throughput (RT)

Total Throughput (RT)


Customized WaveRNN

Tencent proprietary dataset





RT (higher is better) = speech time / time used to synthesize the speech


WaveNet is one of a family of autoregressive deep generative models that have been applied successfully to data as diverse as text, images, music, video, handwriting and human speech. Modeling raw audio signals, as WaveNet does, represents a particularly extreme form of autoregression, with up to 24,000 samples predicted per second. Operating at such a high temporal resolution is not problematic during network training, where the complete sequence of input samples is already available and can be processed in parallel. When generating samples, each input sample must be drawn from the output distribution before it can be passed in as input at the next time step. This makes parallel processing impossible.

Figure 2 Block diagram of WaveNet [4]

Inverse autoregressive flows (IAFs) are an important part in PWaveNet. IAFs represent a dual formulation of deep autoregressive modeling, in which sampling can be performed in parallel, while the training procedure required for likelihood estimation is sequential and slow. Tencent proposed to integrate Parallel WaveNet with GAN [3] for better speech quality.

Optimization Approach

1.       Transform conv1D op with dilation into combination of several
          General Matrix Multiply (GEMM)
In the original graph of the model, the conv1D is composed of conv2D, padding, SpaceToBatchND, BatchToSpaceND, StridedSlice, Concat, etc. Some ops are complicated and running efficiency is low. We simplify the topology, transform the conv1D into combination of several GEMM. This not only reduces the computation, but also reduces the memory access, which can improve the cache hit rate.  (We didn’t directly use the convolution primitive in Deep Neural Network Library (DNNL)/oneAPI Deep Neural Network Library (oneDNN) as the input sizes of conv1D always change.)

2.       Take advantage of Intel DL Boost: AVX-512 instructions
In the model, time proportion of some ops (such as: tanh and sigmoid) is relatively high. We reimplement these ops using Intel AVX-512 instructions; this optimization improves the executing efficiency of instructions to improve performance significantly.

3.       Take advantage of Intel DL Boost: bfloat16 instructions
Supporting bfloat16 instruction is a new feature of 3rd Gen Intel Xeon Scalable processors. It can reduce the time spent swapping numbers in and out of memory. Low-precision circuits are far less complex. In this model, we convert all input data of convolution from FP32 format into bfloat16, then call the bfloat16 GEMM to do the calculation. This optimization can bring significant performance improvements.

4.       Take OpenMP to improve the parallelism of graph flow.
In the model, the composition of some operations is complicated, which limits the parallelism of the whole graph. After simplifying with the GEMM function, we imported the OpenMP parallelism mechanism into code for other trivial operations. This can also improve the performance significantly. 

Table 4. Performance Results

​Test Case ​

FP32 Solution​

BF16 Solution​

​Performance Gain​



Batch size​



Customized PWaveNet

Customer proprietary dataset​





RTF (lower is better) = time used to synthesize the speech / speech time

For WaveRNN, because of the high efficiency, we can use multiple instances (1 instance/core) to get better throughput (RT). But compared to WaveRNN, PWaveNet has higher computing requirements, where multiple instances are not possible. Therefore, we use RTF for latency indicator (1 instance/socket).


In conclusion, we applied several optimization techniques on the 3rd Gen Intel Xeon Scalable processors to Tencent’s customized WaveRNN family neural vocoder and Parallel WaveNet. The optimization techniques helped achieve leading performance on the platform.


Notices and Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance is measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.  For more complete information visit

Performance results are based on testing as of May 11 and April 28, 2020 and may not reflect all publicly available updates. No product or component can be absolutely secure.

Refer to for more information regarding performance and optimization choices in Intel Software Products.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.