Differences between speakers in audio- vs. visual classification of word prominence

Martin Heckmann

Differences between speakers in audio- vs. visual classification of word prominence

Abstract:

We show how the audio-visual discrimination performance of prominent from non-prominent words based on an SVM classifier varies from speaker to speaker. We collected data in an experiment where users were interacting via speech in a small game, designed as a Wizard-of-Oz experiment, with a computer. Following misunderstandings of one single word of the system, users were instructed to correct this word using prosodic cues only. Hence we obtain a dataset which contains the same word with normal and with high prominence. Overall we recorded 8 speakers. The analysis shows that there is a large variation from speaker to speaker in respect to which feature can successfully be used to discriminate prominent from non-prominent words depending on the prominence signaling strategy applied by the speaker. In particular for speakers who mainly use duration to signal prominence we see an increase in performance from combining acoustic and visual information. The audio-visual classification accuracies we obtain vary from 66%−91% correct from the most difficult to the easiest speaker.

Year: 2013
In session: Werkzeuge und Messverfahren
Pages: 166 to 172