Robust Signal Representations for Automatic Speech Recognition

Wednesday, September 20, 2000 - 11:00am - 11:55am
Keller 3-180
Richard Stern (Carnegie-Mellon University)
As speech recognition technology is transferred from the laboratory to the marketplace, robust speech recognition is becoming increasingly important This talk will review and discuss classical and contemporary approaches to robust speech recognition. We begin by reviewing the role of cepstral analysis in speech recognition, as implemented by mel frequency cepstral coefficients and by perceptual linear prediction, along with the contributions of cepstral differences to feature vectors. The most tractable types of environmental degradation are produced by quasi-stationary additive noise and quasi-stationary linear filtering. These distortions can be largely ameliorated by the classical techniques of cepstral high-pass filtering (as exemplified by cepstral mean normalization and RASTA filtering) as well as by techniques that develop statistical models of the distortion such as CDCN and VTS. Nevertheless, these types of approaches fail to provide much useful improvement when speech is degraded by transient or non-stationary noise such as background music or speech. We describe and compare the effectiveness of techniques based on missing-feature compensation and multi-band analysis toward resolving these problems. We briefly review the literature on signal processing based on models of the auditory system and comment on its effectiveness in achieving robustness to date. Finally, we briefly summarize some recent work in which optimal feature sets for particular tasks are developed by nonlinear transformations selected to maximize the likelihood of a particular set of training data.