Talk
Abstract:
Speech Signal Representations in Speech
Synthesis: Current trends and parallels in ASR
Michael W. Macon
Oregon
Graduate Institute Center for Spoken Language Understanding
macon@ece.ogi.edu
http://cslu.cse.ogi.edu
In the earliest days of speech processing research, scientists
searched for the small set of acoustic cues that were most vital
to speech intelligibility. Based on this knowledge, they constructed
ad hoc systems of rules to recognize speech. Under ideal conditions,
these systems worked. However, the recognition performance was
not robust because no one understood how to model the sources
and manifestations of VARIABILITY in speech. In the 1980's,
dramatic changes occurred because data storage and CPU cycles
became cheap (now essentially free). Large statistical models
with many parameters to train replaced the ad hoc systems of
rules. This allowed the arduous task of data collection to replace
ad hoc hand-tuning, but the resulting systems were still quite
brittle in the face of variability. In response to this, researchers
have since invented methods for modeling and understanding certain
systematic components of variability, rather than simply throwing
more parameters into models: e.g., speaker adaptation, noise-
and channel-robust features, speaking-rate dependent pronunciation
modeling, etc.
It appears that this same evolution -- from (1) development
of ad hoc rules, to (2) collection of large catalogues of examples,
to (3) efforts to model sources of variability -- is happening
in speech synthesis research, but delayed by a number of years.
Stage 1 was quite successfully implemented in the 70's and 80's
in the development of formant synthesis rules. Since the mid-90's,
Stage 2 has become the dominant paradigm, embodied in large
database concatenative techniques for "unit selection." Unit
selection produces beautiful but brittle synthesis -- the quality
can vary from fantastic to abysmal, depending on the characteristics
of the input text.
In our lab at OGI, we are hoping to push the synthesis field
into the third stage of evolution - understanding and modeling
various types of systematic variability in speech, and using
these to improve quality. This talk will describe a variety
of techniques that attempt to achieve this goal, and (hopefully)
spawn discussions of what the synthesis, recognition, and mathematics
communities can learn from each other at this stage of the evolutionary
process.
BIOGRAPHY
Michael
Macon received the Ph.D. and M.S.E.E. degrees in
Electrical Engineering from the Georgia Institute of Technology
in 1996 and 1993, respectively, and the B.E.E. degree from the
University of Dayton in 1991. His current research interests
include various topics in speech synthesis. He is an Assistant
Professor in the Department of Electrical and Computer Engineering
and the Center for Spoken Language Understanding at the Oregon
Graduate Institute. Dr. Macon was awarded a National Science
Foundation Faculty Early Career Development (CAREER) award in
1999. He is an Associate Editor of the IEEE Transactions on
Speech and Audio Processing, and serves as a reviewer for several
speech processing journals.
Mathematical
Foundations of Speech Processing and Recognition
2000-2001
Program: Mathematics in Multimedia
|