Talk
Abstracts:
Material
from Talks
Jerome R. Bellegarda (Spoken Language Group, Apple
Computer), jerome@apple.com
Data-Driven Semantic Language Modeling
Statistical language models used in large vocabulary speech
recognition have to properly encapsulate the various constraints,
both local and global, present in the language. While usual
n-gram modeling can readily capture local constraints, it has
been more difficult to handle global constraints, like long-term
semantic dependencies, within an adequate data-driven formalism.
This presentation focuses on the use of latent semantic analysis
(LSA), a paradigm which automatically uncovers the salient semantic
relationships between words and documents in a given corpus.
In this approach, (discrete) words and documents are mapped
onto a single (continuous) semantic vector space of comparatively
low dimension, in which familiar clustering techniques can be
applied. Applications include information retrieval, automatic
semantic classification, and semantic language modeling. In
the latter case, it leads to several model families with various
smoothing properties. Because of their large-span nature, such
LSA language models are well suited to complement conventional
n-grams. An integrative formulation is proposed for harnessing
this synergy, in which the latent semantic information is used
to suitably adjust the standard n-gram probability. This hybrid
modeling compares favorably with the corresponding n-gram baseline:
experiments conducted on the Wall Street Journal domain show
a reduction in average word error rate of over 20%. The presentation
will conclude with a discussion of intrinsic trade-offs and
open issues in this emerging area of statistical language modeling.

Julian
Besag (Department of Statistics, University of Washington)
julian@stat.washington.edu
Markov
Chain Monte Carlo and Bayesian Computation
Markov
chain Monte Carlo (MCMC) refers to a collection of methods for
closely approximating integrals with respect to awkward, often
very high-dimensional, probability distributions. The basic
idea is to design and run a Markov chain whose limiting distribution
is the distribution of interest and to estimate the required
integrals via the corresponding sample averages. The original
Metropolis (1953) construction, generalized by Hastings (1970),
was used extensively in the analysis of interacting particle
systems such as Ising and Potts models. Subsequent developments
included cluster and multigrid algorithms in the 1980's and
simulated tempering and coupling from the past in the 1990's.
The 1980's also saw the introduction of MCMC for simulating
hidden Markov random fields (spatial generalizations of hidden
Markov chains) in image analysis and this inspired other statisticians,
particularly those working in the Bayesian paradigm, where general
implementation had been frustrated by the need to evaluate high-dimensional
integrals. Indeed, MCMC has now become the standard computational
engine for Bayesian analysis. Although Hastings constructions,
especially the Gibbs sampler (aka the heat bath algorithm),
still predominate, there have also been some useful advances
in the design of algorithms, such as reversible jumps. The talk
provides an introduction to MCMC for Bayesian computation. Some
notes (65pp), whose coverage is not restricted to Bayesian inference,
are available at
http://www.csss.washington.edu/Papers/
working paper no.9
These
notes are not comprehensive, though they include some more advanced
topics and at least provide useful further references! See also
the MCMC website for several hundreds of papers: http://www.statslab.cam.ac.uk/~mcmc/
Olivier Catoni (Université Paris 6) catoni@ccr.jussieu.fr
Adaptive N-Grams through the Gibbs Aggregation Rule
We
will present some general pseudo-Bayesian rule to aggregate estimators
and show that it satisfies a non-asymptotic oracle inequality.
We will explain how to apply it to adaptive histogram estimation
and show the connection with the estimation of "adaptive" n-gram
frequencies for text analysis. Most of the material we plan to
explain can be found in written form in the following communication
:
O. Catoni Data compression and adaptive histograms, International
Conference on Foundations of Computational Mathematics in
honor of Professor Steve Smale's 70th Birthday July 13-17
2000
It can be downloaded, as well as a demonstration software,
from the website
http://www.proba.jussieu.fr/users/catoni/gibbsHist_doc/index.html
or from my homepage
http://www.proba.jussieu.fr/users/catoni/homepage/homepage-en.html
Michael
Collins (At&T Labs-Research) mcollins@research.att.com
Statistical
Models for Natural Language Parsing
This
talk will discuss the problem of machine learning applied
to natural language parsing: given a set of example sentence/tree
pairs, the task is to learn a function from sentences to trees
which generalizes well to new sentences.
In
the first part of the talk I will review recent work on probabilistic,
history-based approaches. Much of the recent success of these
methods has been due to the incorporation of lexically conditioned
parameters. I will discuss the importance of head words and
dependency parameters, and also the use of estimation methods
such as decision trees or maximum entropy methods.
While
history-based models have several advantages, it can be awkward
to encode some constraints within this framework. It is often
easy to think of features which might be useful in discriminating
between candidate trees for a sentence, but much more difficult
to alter the model to take these features into account. In
the second part of the talk I will review more recent work
on learning methods which promise to be considerably more
flexible in incorporating features. I will discuss how three
such approaches -- boosting, support vector machines and markov
random fields -- can be applied to parsing, and the similarities
and relative advantages of the three approaches.
Frederick
Jelinek (Department of Electrical and Computer
Engineering, Johns Hopkins University) jelinek@jhu.edu
Parser
- Driven Language Modeling
he
talk will describe two current approaches to parser - driven
language modeling. A bottom - up approach due to Chelba and
Jelinek (the Structured Language Model - SLM) and a top -
down approach due to Brian Roark.
Both
methods have been operated with the help of a stack from which
less probable sub-parse entries are purged before further
words are generated. However, for the SLM it is possible to
generalize the CKY chart parsing algorithm to obtain a chart
which allows the direct computation of language model probabilities
thus rendering the stacks unnecessary.
An
analysis of the behavior of the SLM leads to a generalization
of the Inside - Outside algorithm and thus to rigorous EM
type re-estimation of the SLM parameters. The chart parsing
algorithms seem computationally expensive but their demands
can be mitigated by use of appropriate thresholding.
Mark
Johnson (Department of Cognitive and Linquistic Sciences,
Brown University) mj@lx.cog.brown.edu
An
Introduction to Probabilistic Grammars and their Applications
pdf
postscript
This
talk surveys the roles of grammars in computational linguistics,
and gives a high-level description of HMMs, PCFGs and stochastic
unification grammars. We also describe the estimation problems
from both visible and hidden data.
Sanjeev
Khudanpur (Johns Hopkins University )
Maximum
Entropy Techniques and Exponential Models in SLM/SCL
Maximum
Entropy methods provide elegant means for estimating complex
probabilistic models from sparse data. They have been applied
to various problems encountered in the processing of natural
language including statistical language modeling, part-of-speech
tagging, parsing, text segmentation and classification and machine
translation. This presentation will review of some of these
applications. It will be argued that the main issues in using
maximum entropy techniques are (i) the selection of features
or linear constraints which the model should be required to
satisfy and (ii) the enormous computation involved in obtaining
model parameters given these constraints. Recent proposals in
the literature for addressing these two problems will be mentioned
and some open problems will be brought up for discussion.

Christopher Manning (Stanford University) manning@cs.stanford.edu
Probabilistic
Models in Computational Linguistics
This talk will attempt to cover (very briefly!): what the aims
and goals of computational linguistics are; why there is a need
for statistical and quantitative techniques in linguistics and
computational linguistics, including connections between psycholinguistic
work and computational models; the distinctive features of natural
language from the perspective of statistical models; what the
major practical problems in natural language understanding and
generation are; an overview of the methods that have been used
to tackle those problems; and a few comments on some of the
things that we don't have much idea how to do.

Mehryar
Mohri (At&T Labs-Research) mohri@research.att.com
Finite-State Language Modeling
Weighted
automata and transducers are successfully used in many text
and speech processing applications. They give a unifying framework
for the construction and representation of the components of
speech recognition, speech synthesis and spoken-dialogue systems.
This representation provides opportunities for the application
of general and well-studied algorithms. We give a brief introduction
to the theory of weighted transducers and describe some of these
algorithms.
The
automata used in language modeling for speech synthesis, speech
understanding and speech recognition represent regular languages.
Recent work in the theory of weighted automata shows that the
same weighted automata as those currently used in speech processing
can be used to recognize efficiently non-trivial classes of
context-free languages. We present some results of this work,
and illustrate the efficient recognition of some well-known
context-free languages.
Roni
Rosenfeld (School of Computer Science, Cernegie Mellon
University) Roni_Rosenfeld@alf11.speech.cs.cmu.edu
An
Accelerated Introduction to Statistical Language Modeling
Statistical Language Models estimate the distribution of various
natural language phenomena for the purpose of speech recognition
and other language technologies. Since the first significant
model was proposed in 1980, many attempts have been made to
improve the state of the art. I will review those, point to
a few promising directions, and argue for a Bayesian approach
to integration of linguistic theories with data.
The
talk will be loosely based on the survey paper:
"Two
decades of Statistical Language Modeling: Where Do We Go From
Here?", Roni Rosenfeld, Proceedings of the IEEE 88(8), August
2000 pdf
(98KB) postscript
(113KB).
that
was mailed to the workshop participants ahead of the workshop.
However, the talk will be considerably more technical, and will
be specifically designed to rapidly introduce statisticians
and mathematicians to the field.
Jean-Philippe
Vert (Ecole Normale Superieure of Paris) http://www.dma.ens.fr/users/vert/
Jean-Philippe.Vert@ens.fr
Adaptive
Context Tree and Text Categorization
In
this talk I present an adaptive algorithm for estimating the
conditional probability of a letter knowing the past and give
an "oracle" inequality for its conditional Kullback-Leibler
risk. I argue that such adaptive estimators can be used to represent
texts and show experimental results of unsupervised and supervised
text categorization based on this representation.
Material
from Talks
Mathematical Foundations of Natural Language Modeling
2000-2001
Program: Mathematics in Multimedia
|