Data-Driven Semantic Language Modeling

Monday, October 30, 2000 - 2:00pm - 3:00pm
Keller 3-180
Jerome Bellegarda (Apple Computer, Inc.)
Statistical language models used in large vocabulary speech recognition have to properly encapsulate the various constraints, both local and global, present in the language. While usual n-gram modeling can readily capture local constraints, it has been more difficult to handle global constraints, like long-term semantic dependencies, within an adequate data-driven formalism. This presentation focuses on the use of latent semantic analysis (LSA), a paradigm which automatically uncovers the salient semantic relationships between words and documents in a given corpus. In this approach, (discrete) words and documents are mapped onto a single (continuous) semantic vector space of comparatively low dimension, in which familiar clustering techniques can be applied. Applications include information retrieval, automatic semantic classification, and semantic language modeling. In the latter case, it leads to several model families with various smoothing properties. Because of their large-span nature, such LSA language models are well suited to complement conventional n-grams. An integrative formulation is proposed for harnessing this synergy, in which the latent semantic information is used to suitably adjust the standard n-gram probability. This hybrid modeling compares favorably with the corresponding n-gram baseline: experiments conducted on the Wall Street Journal domain show a reduction in average word error rate of over 20%. The presentation will conclude with a discussion of intrinsic trade-offs and open issues in this emerging area of statistical language modeling.