Modelling Graph-based Observation Spaces for Segment-Based Speech Recognition

Tuesday, September 19, 2000 - 2:00pm - 2:55pm
Keller 3-180
James Glass (Massachusetts Institute of Technology)
In most current speech recognizers, the observation space of an utterance consists of a temporal sequence of frames. An important property of this framework is that every segmentation of the input utterance accounts for all of the observations. In contrast, in a feature-based framework based on segments (either implicit or explicit) each segment is represented by a fixed-dimensional feature vector, so that alternative segmentations of the utterance will produce different feature vector sequences. The total observation space for all possible segmentations can be represented as a temporal graph of feature vectors.

In our work with segment-based speech recognition, we have explored probabilistic frameworks which allow us to compare different segmentations by considering the entire observation space of features. The first approach adds an extra lexical unit which is defined to map to all segments which do not correspond to one of the existing units. In our phonetic-based modelling, we call this unit the anti-phone, and use it to model all feature vectors which are not hypothesized to be a phonetic unit. Two competing segmentations must therefore account for all segments, either as a normal acoustic-phonetic unit or as the anti-phone. An extension of the anti-phone concept partitions the observation space into near-miss subsets whereby each segment in a hypothesized segmentation is associated with a near-miss subset of segments which are not in the segmentation. To be accurate, the near-miss subsets for every segmentation must be mutually-exclusive and exhaustive.

In this talk, I will describe the probabilistic frameworks using the anti-phone and near-miss modelling techniques, and show how they have been employed in a segment-based recognizer to achieve state-of-the-art results on a common phonetic recognition task.