Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition

Tuesday, September 19, 2000 - 9:30am - 10:25am
Keller 3-180
Wendy Holmes (20/20/Speech, Ltd.)
HMMs provide a tractable mathematical framework for training and recognition, combined with a model structure that is broadly appropriate for speech. However, it is generally acknowledged that HMMs provide a somewhat crude model of the speech production process. In particular, the assumptions of piecewise stationarity and of independence are not appropriate for the continuously-evolving dynamic nature of speech production. This talk will begin by discussing some of the advantages and limitations of HMMs, and explaining how these limitations can be addressed by using a segmental HMM, in which states are associated with sequences of observations rather than with individual observations. Different trajectory-based models for describing signal dynamics will be described, and some experimental investigations with a particular class of segmental HMMs will be presented.

The second part of the talk will consider the requirements for a good model for automatic speech recognition in more general terms. It will be argued that such a model needs to capture the dynamics of the underlying speech production process, and also provide a meaningful characterization of differences between speakers, effects of noise and so on. These requirements suggest that the model should be expressed in terms of parameters that are closely related to speech production, such as articulatory or formant parameters. It should then be possible to develop a model of speech that can be used for a disciplined approach to speaker adaptation, as well as being directly applicable to both synthesis and recognition. To illustrate the principles of this integrated approach, formant-based trajectory segmental HMMs have been applied to recognition-synthesis speech coding.