From Writing to Speaking: on the Limits of Text-Trained Authorship Models for Speech Transcripts

Abstract:

Text-based speaker verification from speech transcripts is challenging because spoken language contains disfluencies and interactional markers that are largely absent from written text. In this work, we evaluate whether neural authorship representations can capture speaker-specific linguistic style. We fine-tune LUAR-MUD, a RoBERTa-based authorship model, on transcripts from three speech genres: spontaneous dialogue (SwDA - Switchboard Dialog Act), prepared monologue (TED-LIUM), and read speech (LibriSpeech), and evaluate both indomain and cross-domain speaker verification. Fine-tuning consistently improves performance over the pretrained baseline, reducing Equal Error Rate (EER) on conversational speech from 19.4% to 9.1%, with measurable generalization across speech genres. Models trained on prepared and read speech also transfer to conversational data, though with higher error rates. Ablation experiments removing filled pauses, discourse markers, and false starts lead to only limited performance changes, suggesting that speaker discrimination relies on distributed, idiosyncratic stylistic patterns rather than individual disfluency types. Speaker-level analyses further reveal substantial inter-speaker variability and weak correlations among conversational features, indicating that neural embeddings encode interactionally grounded stylistic signatures that persist across speech production regimes.


Year: 2026
In session: Speech Synthesis
Pages: 86 to 93