ESSV Konferenz Elektronische Sprachsignalverarbeitung

Title: Multimodal speech segmentation using gaze data and spectrogram image features

Authors: Arif Khan, Ingmar Steiner


Nearly all automatic speech segmentation approaches rely solely onacoustic features, which differs from the way humans segment speech using pho-netic annotation software.In order to get closer to human-level precision in speech segmentation, we adopt amultimodal approach to improve the segmentation accuracy. To this end, we ana-lyze a database of segmentation behavior collected using an eye tracker, obtainedfrom human experts performing a manual segmentation task. This allows us to in-troduce gaze as an additional modality for automatic segmentation by transformingit into features for image based phoneme segmentation (ISeg).Experiments were conducted for automatic speech segmentation, comparing theimage-only, ISeg technique, as well as ISeg combined with hidden Markov model(HMM) based acoustic segmentation, with respective segmentation approachesconditioned on the gaze data. The results show that enhancing the image basedsegmentation with gaze information improves the accuracy of ISeg, as well as ISegcombined with HMMs.

Year: 2019
In session: Poster und Demonstrationen
Pages: 197 to 204