Switching Dynamic-System Models for Speech Articulation and Acoustics

Tuesday, September 19, 2000 - 11:00am - 12:00pm
Keller 3-180
Li Deng (Microsoft Research)
A statistical generative model for the speech process is described that embeds a substantially richer structure than the HMM currently in predominant use for speech recognition. This (multi-level) switching dynamic-system model generalizes and integrates the HMM and the (piece-wise) stationary linear dynamic system (state-space) model. Depending on the level and the nature of the switching in the model design, various key properties of the speech dynamics can be naturally represented in the model. Such properties include the temporal structure of the speech acoustics, its causal articulatory movements, and the control of such movements by the multidimensional targets correlated with the phonological (symbolic) units of speech in terms of overlapping articulatory features.

Some simplification of this model reduces it to several models widely used in the control, econometrics, signal processing (target tracking), and neural computation literatures. One main challenge of using the multi-level switching dynamic-system model for successful speech recognition, especially for unconstrained, conversational speech recognition, is the computationally intractable inference (decoding) on the posterior probabilities of the hidden states --- both discrete phonological states and continuous articulatory states. This leads to computationally intractable EM-based parameter learning (training). Some research on approximate, computationally tractable inference and learning algorithms will be discussed in this lecture.