# Speech Recognition Using Deep Learning on Intel® Architecture

Published: 04/17/2018

Last Updated: 04/16/2018

## Abstract

This paper demonstrates how to train and infer the speech recognition problem using deep neural networks on Intel® architecture. A scratch training approach was used on the Speech Commands dataset that TensorFlow* recently released. Inference was done using test audio clips to detect the label. The experiments were run on an Intel® Xeon® Gold processor system.

## Introduction

The audio classification tasks are divided into three sub domains: music classification, speech recognition (particularly for the acoustic model), and acoustic scene classification. With the rapid development of mobile devices, speech-related technologies are becoming increasingly popular. For example, Google offers the ability to search by voice on Android* phones. In this study, we approach the speech recognition problem building a basic speech recognition network that recognizes thirty different words using a TensorFlow-based implementation.

To help with this experiment, TensorFlow recently released the Speech Commands datasets. It includes 65,000 one-second-long utterances of 30 short words by thousands of different people.

Continued research in the deep learning space has resulted in the evolution of many frameworks to solve the complex problem of speech recognition. These frameworks have been optimized, specific to the hardware, where they are run for better accuracy, reduced loss, and increased speed. In these lines, Intel has optimized the TensorFlow library for better performance on its Intel® Xeon® processors. This paper discusses the training and inferencing speech recognition problem that is built using sample convolutional neural network (CNN) architecture with the TensorFlow framework on an cluster powered by Intel® processors. We have adopted an approach by training the model from scratch.

## Document Content

This section describes the end-to-end steps, from choosing the environment to running the tests on the trained speech recognition model.

### Choosing the environment

Hardware
Experiments were performed on Intel Xeon Gold processor-powered systems. Table 1 list the hardware details.

Table 1. Intel Xeon Gold processor configuration.

 Architecture x86_64 CPU op-mode(s) 32 bit, 64 bit Byte order Little endian CPU(s) 24 Core(s) per socket Six Socket(s) Two CPU family Six Model 85 Model name Intel Xeon Gold 6128 CPU at 3.40 GHz RAM 92 GB

Software
Intel® Optimization for TensorFlow* framework, along with Intel® Distribution for Python*, were used as the software configuration. Tables 2 list the details of the software.

Table 2. Software configuration – Intel Xeon Gold processor

 TensorFlow 1.4.0 (optimized by Intel) Python* 3.6 TensorBoard* 0.1.5

The software configurations listed in Table 2 are available on the hardware environments chosen, and no source build for TensorFlow was necessary.

### Dataset

The Speech Commands dataset (TAR file) is comprised of 65,000 WAVE audio files (.wav) of people saying 30 different words. This data was collected by Google and released under a CC BY license, and this archive is more than 1 GB. Each audio file is a 1-second audio clip as either silence, an unknown word, yes, no, up, down, left, right, on, off, stop, or go. Twelve different sounds of the entire dataset consisting of 30 sounds were used for this experiment.

Total training data set: 23701

Training: 80 percent -- 18961
Validation: 10 percent -- 2370
Testing: 10 percent -- 2370

We used a hash-function-based split to prevent repeating the files from one set to another.

We maintained a list of all words such as up, go, off, on, stop, and so on.Train and test split was done based on each word to ensure all classes were covered so that there was no class imbalance.

The architecture used is based on the Convolutional Neural Networks for Small-footprint Keyword Spotting paper. TensorFlow provides different approaches to building neural network models. We chose CNN-TRAD-POOL3, because it is comparatively simple, quick to train, and easy to understand. The CNN-TRAD-POOL3 network is made of two convolution layers, max-pooling layers, one linear low-rank layer, one DNN layer, and one softmax layer. Figure 1 shows the CNN-TRAD-POOL3 architecture.

### Execution steps

This section describes the steps we used in the end-to-end process for training, validation, and testing the speech recognition model on Intel® architecture.

1. Setup for training
2. Model training
3. Inference

### Setup for training

1. Install the optimization for TensorFlow optimized by Intel.
2. Clone the TensorFlow repository from https://github.com/tensorflow/tensorflow.

### Model training

After cloning the TensorFlow repository, the next step is to train the model. We adopted a scratch training technique in order to retrain all layers from scratch.

The following command downloads the speech commands dataset and trains the algorithm toward detecting audio samples:

python tensorflow/examples/speech_commands/train.py/pre>

Experimental runs with inference

On the Intel Xeon Gold Processor – Intel® AI DevCloud Cluster

To execute on the Intel AI DevCloud cluster, use the following command to submit the training job:

qsub speech.sh -l walltime=24:00:00 /pre>

On this cluster, there is a restriction on walltime of six hours to execute a job. The maximum value of walltime that can be set is 24 hours. As shown in the qsub command, the walltime is set to 24 hours.

The job script speech.sh has the following code:
#!/bin/sh
#PBS -l walltime=24:00:00
which python
cd ~/tensorflow/
export PATH=/glob/intel-python/python3/bin/:\$PATH
numactl --interleave=all python ~/tensorflow/tensorflow/examples/speech_commands/train.py

The following shows the details of the steps and accuracies:

TensorBoard* Graphs

TensorBoard is an effective tool to use for visualizing the training progress. By default, the script saves events to /tmp/retrain_logs, and loads the scripts by running the following command:

tensorboard --logdir /tmp/retrain_logs/pre>

Figure 2 shows the TensorBoard graphs for the Intel Xeon Gold processor.

Figure 2. TensorBoard graphs - Intel Xeon Gold processor.

The script used to export the trained model file for inference is as follows:
echo python ~/tensorflow/tensorflow/examples/speech_commands/freeze.py --start_checkpoint=~/kaggle-speech/speech_commands_train/conv.ckpt-68000 --output_file=~/kaggle-speech/my_frozen_graph_68000.pb | qsub

After the frozen model has been created, using the following code, test it with the label_wav.py script:
echo python ~/tensorflow/tensorflow/examples/speech_commands/label_wav.py  --graph=~/kaggle-speech/my_frozen_graph_68000.pb --labels=~/kaggle-speech/speech_commands_train/conv_labels.txt --wav=~/kaggle-speech/speech_dataset/left/a5d485dc_nohash_0.wav
| qsub

left (score = 0.96563)
right (score = 0.02616)
_unknown_ (score = 0.00717)

left is the top score because it is the correct label.

Intel® Xeon® Gold processor metrics

Table 3. Intel Xeon Gold processor metrics.

Properties

Intel Xeon Gold Processor

Total amount of time

83,400 (seconds)

Total number of steps

68,000

Batch size

100

Total Wav files

6,800,000

Wav files per second

(Total Wav files / Total amount of time)

81.53

Training accuracy

93 percent

Validation accuracy

92 percent

Testing accuracy

92.5 percent

Conclusion

In this paper, we showed how we trained and tested speech recognition from scratch using a sample CNN model and the TensorFlow audio recognition dataset on the Intel Xeon Gold processor-based environments. The experiment can be extended by applying different optimization algorithms, changing learning rates, and varying input sizes, further improving accuracy.

Rajeswari Ponnuru and Ravi Keron Nidamarty are members of the Intel team, working on evangelizing artificial intelligence in the academic environment.

References

Kaggle's TensorFlow speech recognition challenge
TensorFlow for audio recognition tutorial

Related Resources

TensorFlow* Optimizations on Modern Intel® Architecture
Build and Install TensorFlow* on Intel® Architecture



#### Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.