 |
|
|
by Ara V. Nefian, Lu Hong Liang, Xiao Xing Liu, Xiaobo Pi
The increase in the number of multimedia applications that require robust speech recognition systems determined a large interest in the study of audio-visual speech recognition (AVSR) systems. The use of visual features in AVSR is justified by both the audio and visual modality of the speech generation and the need for features that are invariant to acoustic noise perturbation. The speaker independent audio-visual continuous speech recognition system relies on a robust set of visual features obtained from the accurate detection and tracking of the mouth region. Further, the visual and acoustic observation sequences are integrated using a coupled hidden Markov model (CHMM) shown in Figure 1. The statistical properties of the CHMM can model the audio and visual state asynchrony while preserving their natural correlation over time. The experimental results show that the current system tested on the XM2VTS database (295 speakers) reduces by over 55% the error rate of the audio only speech recognition system at SNR of 0db (Figure 2).
 |
| Figure 1. A coupled HMM used in audio-visual integration. |
 |
| Figure 2. The word error rate (WER) at different signal to noise ratio (SNR) levels for audio-only, video-only and audio-visual speech recognition.
|
 |
|
 |
| Figure 3. Speech recognition examples for an audio-visual sequence captured in clean (top) and noisy (bottom, SNR = 5db) acoustic conditions.
|
|
|
 |
 |
|
|
 | Ara V Nefian, Lu Hong Liang, Xiao Xing Liu, Xiaobo Pi and Kevin Murphy, "Dynamic Bayesian networks for audio-visual speech recognition", EURASIP, Journal of Applied Signal Processing, vol. 2002, no 11, p. 1274-1288, 2002. | |
 | Xiao Xing Liu, Yibao Zhao, Xiaobo Pi, Lu Hong Liang and Ara V Nefian, "Audio-visual continuous speech recognition using a coupled hidden Markov model", IEEE International Conference on Spoken Language Processing , p. 213-216, September 2002. | |
 | Lu Hong Liang, Xiao Xing Liu, Yibao Zhao, Xiaobo Pi and Ara V Nefian, "Speaker independent audio-visual continuous speech recognition", IEEE International Conference on Multimedia and Expo, vol.2, p. 25-28, August 2002. | |
 | Ara V Nefian, Lu Hong Liang, Xiao Xing Liu, Xiaobo Pi, Crusoe Mao and Kevin Murphy, "A coupled HMM for audio-visual speech recognition", International Conference on Acoustics Speech and Signal Processing , vol II, pp 2013-2016, Orlando, Florida, May 2002. | |
|
|
|
All information provided related to future Intel products and plans is preliminary and subject to change at any time, without notice.
|
|
|
 |
|