Self-Supervised Multi-Task Learning for Enhanced Prosody Prediction in German Articulatory Speech Synthesis
Authors: Zihao Huang, Tianyi Zhang, Peter Birkholz
Abstract:
This paper presents a systematic comparison of self-supervised pre-training strategies for prosody modelling. We evaluated three pretext tasks within a unified LSTM-based architecture. The pre-trained encoder is integrated into a multi-task prosody model that jointly predicts phoneme duration, fundamental frequency ( f0), and voicing. Objective evaluation showed that all pre-training methods improve prosody prediction compared to a baseline, particularly for pitch. Subjective listening tests, however, revealed no significant differences in perceived naturalness, indicating that objective gains do not always translate into perceptual advantage. These findings demonstrate that self-supervised pre-training enhances prosody prediction, while perceptual benefits depend on specific aspects of prosodic realization.


