Speech Signal Representations in Speech Synthesis: Current trends and parallels in ASR

Wednesday, September 20, 2000 - 9:30am - 10:25pm
Keller 3-180
Michael Macon (Oregon Graduate Institute of Science & Technology)
In the earliest days of speech processing research, scientists searched for the small set of acoustic cues that were most vital to speech intelligibility. Based on this knowledge, they constructed ad hoc systems of rules to recognize speech. Under ideal conditions, these systems worked. However, the recognition performance was not robust because no one understood how to model the sources and manifestations of VARIABILITY in speech. In the 1980's, dramatic changes occurred because data storage and CPU cycles became cheap (now essentially free). Large statistical models with many parameters to train replaced the ad hoc systems of rules. This allowed the arduous task of data collection to replace ad hoc hand-tuning, but the resulting systems were still quite brittle in the face of variability. In response to this, researchers have since invented methods for modeling and understanding certain systematic components of variability, rather than simply throwing more parameters into models: e.g., speaker adaptation, noise- and channel-robust features, speaking-rate dependent pronunciation modeling, etc.

It appears that this same evolution -- from (1) development of ad hoc rules, to (2) collection of large catalogues of examples, to (3) efforts to model sources of variability -- is happening in speech synthesis research, but delayed by a number of years. Stage 1 was quite successfully implemented in the 70's and 80's in the development of formant synthesis rules. Since the mid-90's, Stage 2 has become the dominant paradigm, embodied in large database concatenative techniques for unit selection. Unit selection produces beautiful but brittle synthesis -- the quality can vary from fantastic to abysmal, depending on the characteristics of the input text.

In our lab at OGI, we are hoping to push the synthesis field into the third stage of evolution - understanding and modeling various types of systematic variability in speech, and using these to improve quality. This talk will describe a variety of techniques that attempt to achieve this goal, and (hopefully) spawn discussions of what the synthesis, recognition, and mathematics communities can learn from each other at this stage of the evolutionary process.


Michael Macon received the Ph.D. and M.S.E.E. degrees in Electrical Engineering from the Georgia Institute of Technology in 1996 and 1993, respectively, and the B.E.E. degree from the University of Dayton in 1991. His current research interests include various topics in speech synthesis. He is an Assistant Professor in the Department of Electrical and Computer Engineering and the Center for Spoken Language Understanding at the Oregon Graduate Institute. Dr. Macon was awarded a National Science Foundation Faculty Early Career Development (CAREER) award in 1999. He is an Associate Editor of the IEEE Transactions on Speech and Audio Processing, and serves as a reviewer for several speech processing journals.