Can We See Your Response Before You Speak? Exploring Linguistic Information Found in Inter-Turn Pauses

Abstract:

In this work we assess whether there is information in pauses in-between utterances of the same or different speakers that are predictive of the following speaker's utterance. We present models that connect a person's visual features before they speak to their upcoming utterance. In our experiments we find that outof- the-box pre-trained models can already reach a better-than-chance performance in correlating video embeddings to utterance embeddings. In contrast, models that attempt to predict the first word after the pause do not outperform a unigram model, indicating that our models do not read lips (based e.g. on co-articulation effects) but rather capture more fundamental aspects of the upcoming utterance.


Year: 2024
In session: Large Language Models
Pages: 165 to 172