Team 2: Optimizing Language Models and Texts for Automatic Speech Recognition

Wednesday, July 19, 2000 - 9:25am - 9:45am
Keller 3-180
Joan Bachenko (Linguistic Technologies, Inc.)
Speech recognizers incorporate three core modules: the decoder, which performs pattern matching; the language model, which defines the vocabulary and word patterns; and the acoustic model, which defines the phone set and phone patterns. The process of recognition is essentially a series of guesses among thousands of hypotheses. The job of language and acoustic models is to represent the hypotheses and their likelihood in order to maximize the recognizer's chances of getting the right output for speech input. This workshop will focus on language modeling and on experiments in language model optimization. Training data, language model software and access to a high quality speech recognizer will be made available to participating students.

A language model (LM) is a probabilistic model trained on text data. Most LMs today are trigram models (where gram is a word) that back off to bigrams and unigrams and use a smoothing technique to handle sparse data. For working speech recognition systems, the LM is only as good as the text that it trains on. Hence texts are usually taken from some limited domain, e.g. airline reservations or radiology, in order to constrain the set of hypotheses that the LM makes available. If the training text is too limited, however, the LM will fail to represent ngrams that are likely to be spoken.

One question we will be addressing in the workshop is how to determine when a training text is sufficiently good for producing a good LM. Another question we will address is how to partition a training text into minimally overlapping sublanguage texts in order to build good sublanguage models. For example, is it possible to predict whether the best LM includes both pediatrics and general medicine or whether the recognizer will perform better with two distinct LMs; if the best approach is two LMs, then how should the distance between them be measured and optimized? The workshop will focus on lexical growth and LM interpolation in an attempt to provide some answers to these questions. Lexical growth is a measure of the rate at which new words are observed in a text as the text grows in size. Predictive models of lexical growth exist and will be discussed. LM interpolation is a statistical method of weighting texts in LM construction. Although interpolation is commonly used in adapting LMs to new domains, little is known about how interpolation can be used to predict LM performance.



Speech at Carnegie-Mellon University: This is one of the best toolkits for building LMs for research. You can download the CMU Statistical Language Modeling Toolkit and read the documentation.

Speech at Cambridge University: Click on Links to Related Sites to visit other speech recognition laboratories. Cambridge is closely connected to Entropic Cambridge Research Laboratory, which produces a highly regarded speech recognizer called HTK (Hidden Markov Model Toolkit).

Center for Language and Speech Processing at Johns Hopkins University: Follow the links to workshops.

Allen, James. 1995. Natural Language Understanding. Redwook City, CA: Benjamin/Cummings Publishing. Chapter 7 and Appendix C.

Jurafsky, Dan and James H. Martin. 2000. Speech and Language Processing. Englewood Cliffs, NJ: Prentice-Hall. Chapters 5, 6, 7.