HOME    »    PROGRAMS/ACTIVITIES    »    Annual Thematic Program

Material from Talks

Jerome R. Bellegarda (Spoken Language Group, Apple Computer), jerome@apple.com

Data-Driven Semantic Language Modeling

Statistical language models used in large vocabulary speech recognition have to properly encapsulate the various constraints, both local and global, present in the language. While usual n-gram modeling can readily capture local constraints, it has been more difficult to handle global constraints, like long-term semantic dependencies, within an adequate data-driven formalism. This presentation focuses on the use of latent semantic analysis (LSA), a paradigm which automatically uncovers the salient semantic relationships between words and documents in a given corpus. In this approach, (discrete) words and documents are mapped onto a single (continuous) semantic vector space of comparatively low dimension, in which familiar clustering techniques can be applied. Applications include information retrieval, automatic semantic classification, and semantic language modeling. In the latter case, it leads to several model families with various smoothing properties. Because of their large-span nature, such LSA language models are well suited to complement conventional n-grams. An integrative formulation is proposed for harnessing this synergy, in which the latent semantic information is used to suitably adjust the standard n-gram probability. This hybrid modeling compares favorably with the corresponding n-gram baseline: experiments conducted on the Wall Street Journal domain show a reduction in average word error rate of over 20%. The presentation will conclude with a discussion of intrinsic trade-offs and open issues in this emerging area of statistical language modeling.

Julian Besag (Department of Statistics, University of Washington)   julian@stat.washington.edu

Markov Chain Monte Carlo and Bayesian Computation

Markov chain Monte Carlo (MCMC) refers to a collection of methods for closely approximating integrals with respect to awkward, often very high-dimensional, probability distributions. The basic idea is to design and run a Markov chain whose limiting distribution is the distribution of interest and to estimate the required integrals via the corresponding sample averages. The original Metropolis (1953) construction, generalized by Hastings (1970), was used extensively in the analysis of interacting particle systems such as Ising and Potts models. Subsequent developments included cluster and multigrid algorithms in the 1980's and simulated tempering and coupling from the past in the 1990's. The 1980's also saw the introduction of MCMC for simulating hidden Markov random fields (spatial generalizations of hidden Markov chains) in image analysis and this inspired other statisticians, particularly those working in the Bayesian paradigm, where general implementation had been frustrated by the need to evaluate high-dimensional integrals. Indeed, MCMC has now become the standard computational engine for Bayesian analysis. Although Hastings constructions, especially the Gibbs sampler (aka the heat bath algorithm), still predominate, there have also been some useful advances in the design of algorithms, such as reversible jumps. The talk provides an introduction to MCMC for Bayesian computation. Some notes (65pp), whose coverage is not restricted to Bayesian inference, are available at

http://www.csss.washington.edu/Papers/      working paper no.9

These notes are not comprehensive, though they include some more advanced topics and at least provide useful further references! See also the MCMC website for several hundreds of papers: http://www.statslab.cam.ac.uk/~mcmc/

Olivier Catoni (Université Paris 6) catoni@ccr.jussieu.fr

Adaptive N-Grams through the Gibbs Aggregation Rule

We will present some general pseudo-Bayesian rule to aggregate estimators and show that it satisfies a non-asymptotic oracle inequality. We will explain how to apply it to adaptive histogram estimation and show the connection with the estimation of "adaptive" n-gram frequencies for text analysis. Most of the material we plan to explain can be found in written form in the following communication :

O. Catoni Data compression and adaptive histograms, International Conference on Foundations of Computational Mathematics in honor of Professor Steve Smale's 70th Birthday July 13-17 2000

It can be downloaded, as well as a demonstration software, from the website


or from my homepage


Michael Collins (At&T Labs-Research)  mcollins@research.att.com

Statistical Models for Natural Language Parsing

This talk will discuss the problem of machine learning applied to natural language parsing: given a set of example sentence/tree pairs, the task is to learn a function from sentences to trees which generalizes well to new sentences.

In the first part of the talk I will review recent work on probabilistic, history-based approaches. Much of the recent success of these methods has been due to the incorporation of lexically conditioned parameters. I will discuss the importance of head words and dependency parameters, and also the use of estimation methods such as decision trees or maximum entropy methods.

While history-based models have several advantages, it can be awkward to encode some constraints within this framework. It is often easy to think of features which might be useful in discriminating between candidate trees for a sentence, but much more difficult to alter the model to take these features into account. In the second part of the talk I will review more recent work on learning methods which promise to be considerably more flexible in incorporating features. I will discuss how three such approaches -- boosting, support vector machines and markov random fields -- can be applied to parsing, and the similarities and relative advantages of the three approaches.

Frederick Jelinek (Department of Electrical and Computer Engineering, Johns Hopkins University)  jelinek@jhu.edu

Parser - Driven Language Modeling

he talk will describe two current approaches to parser - driven language modeling. A bottom - up approach due to Chelba and Jelinek (the Structured Language Model - SLM) and a top - down approach due to Brian Roark.

Both methods have been operated with the help of a stack from which less probable sub-parse entries are purged before further words are generated. However, for the SLM it is possible to generalize the CKY chart parsing algorithm to obtain a chart which allows the direct computation of language model probabilities thus rendering the stacks unnecessary.

An analysis of the behavior of the SLM leads to a generalization of the Inside - Outside algorithm and thus to rigorous EM type re-estimation of the SLM parameters. The chart parsing algorithms seem computationally expensive but their demands can be mitigated by use of appropriate thresholding.

Mark Johnson (Department of Cognitive and Linquistic Sciences, Brown University)  mj@lx.cog.brown.edu

An Introduction to Probabilistic Grammars and their Applications   pdf    postscript

This talk surveys the roles of grammars in computational linguistics, and gives a high-level description of HMMs, PCFGs and stochastic unification grammars. We also describe the estimation problems from both visible and hidden data.

Sanjeev Khudanpur (Johns Hopkins University )

Maximum Entropy Techniques and Exponential Models in SLM/SCL

Maximum Entropy methods provide elegant means for estimating complex probabilistic models from sparse data. They have been applied to various problems encountered in the processing of natural language including statistical language modeling, part-of-speech tagging, parsing, text segmentation and classification and machine translation. This presentation will review of some of these applications. It will be argued that the main issues in using maximum entropy techniques are (i) the selection of features or linear constraints which the model should be required to satisfy and (ii) the enormous computation involved in obtaining model parameters given these constraints. Recent proposals in the literature for addressing these two problems will be mentioned and some open problems will be brought up for discussion.  

Christopher Manning (Stanford University)   manning@cs.stanford.edu

Probabilistic Models in Computational Linguistics

This talk will attempt to cover (very briefly!): what the aims and goals of computational linguistics are; why there is a need for statistical and quantitative techniques in linguistics and computational linguistics, including connections between psycholinguistic work and computational models; the distinctive features of natural language from the perspective of statistical models; what the major practical problems in natural language understanding and generation are; an overview of the methods that have been used to tackle those problems; and a few comments on some of the things that we don't have much idea how to do.

Mehryar Mohri (At&T Labs-Research)   mohri@research.att.com

Finite-State Language Modeling

Weighted automata and transducers are successfully used in many text and speech processing applications. They give a unifying framework for the construction and representation of the components of speech recognition, speech synthesis and spoken-dialogue systems. This representation provides opportunities for the application of general and well-studied algorithms. We give a brief introduction to the theory of weighted transducers and describe some of these algorithms.

The automata used in language modeling for speech synthesis, speech understanding and speech recognition represent regular languages. Recent work in the theory of weighted automata shows that the same weighted automata as those currently used in speech processing can be used to recognize efficiently non-trivial classes of context-free languages. We present some results of this work, and illustrate the efficient recognition of some well-known context-free languages.

Roni Rosenfeld (School of Computer Science, Cernegie Mellon University)  Roni_Rosenfeld@alf11.speech.cs.cmu.edu

An Accelerated Introduction to Statistical Language Modeling

Statistical Language Models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. I will review those, point to a few promising directions, and argue for a Bayesian approach to integration of linguistic theories with data.

The talk will be loosely based on the survey paper:

"Two decades of Statistical Language Modeling: Where Do We Go From Here?", Roni Rosenfeld, Proceedings of the IEEE 88(8), August 2000  pdf (98KB)    postscript (113KB).

that was mailed to the workshop participants ahead of the workshop. However, the talk will be considerably more technical, and will be specifically designed to rapidly introduce statisticians and mathematicians to the field.

Jean-Philippe Vert (Ecole Normale Superieure of Paris)    http://www.dma.ens.fr/users/vert/   Jean-Philippe.Vert@ens.fr

Adaptive Context Tree and Text Categorization

In this talk I present an adaptive algorithm for estimating the conditional probability of a letter knowing the past and give an "oracle" inequality for its conditional Kullback-Leibler risk. I argue that such adaptive estimators can be used to represent texts and show experimental results of unsupervised and supervised text categorization based on this representation.

Material from Talks

Mathematical Foundations of Natural Language Modeling

2000-2001 Program: Mathematics in Multimedia