TensorTract3: Pushing the Limits of Articulatory Speech Synthesis

Abstract:

Articulatory speech synthesis offers interpretability and physiologically grounded control over speech production, but achieving high intelligibility remains difficult. Accurate synthesis requires precise, time-varying motor control of a high-dimensional articulatory system, comparable to the demanding sensorimotor skill underlying human speech. Speech inversion—inferring motor control from acoustics—has been proposed as a solution, but high-intelligibility results have remained limited. In this work, we present TensorTract3 (TT3), a decoder-guided speech inversion system for the articulatory synthesizer VocalTractLab (VTL) that predicts articulatory trajectories from WavLM features. Rather than learning inversion purely in the articulatory domain, TT3 is trained through a pretrained articulatory-to-acoustic decoder that defines an acoustic loss in WavLM feature space, enabling both supervised training on synthetic paired trajectories and training directly on natural speech without articulatory labels. On German and English benchmarks, increasing the acoustic-loss weight reduces VTL synthesis character error rate (CER) from∼83% to 4.9%; training on natural speech further reaches CER ≈3–4% and yields neural synthesis from inferred controls close to natural-speech intelligibility. These results indicate that articulatory reconstruction accuracy alone is a weak proxy for intelligibility and motivate decoder-guided, “vocal- learning” objectives for future speech inversion research.


Year: 2026
In session: Speech Synthesis
Pages: 112 to 119