Can Deep Learning Help to Understand Speech Production Mechanisms?

Abstract:

Deep Learning has become inescapable in all fields of research. It leads to unprecedented levels of prediction but is often associated with a loss in understanding the considered phenomenon. This study aims on the contrary at taking advantage of the performance of Deep Learning to increase knowledge in speech production. The study explores more specifically the potential of Deep Learning as an original method to help at determining the cross-speaker vowel-specific articulatory invariants, i.e. the stable articulatory features in the production of vowels. 228 midsagittal MRI data of 41 speakers articulating 6 vowels have been considered, for which manually traced vocal tract contours are available and aligned in a common reference coordinate system. Convolutional Neural Networks have been trained to classify the images in terms of vowel for five increasingly challenging classification scenarios, from two to six classes, in a leave-one-speaker-out scheme, with accuracies above 99%. The Grad-CAM algorithm has been applied to all test images, resulting in heatmaps identifying the determinant vocal tract regions for a robust classification of the image. The edges of this region for each image have been aligned in the reference coordinate and averaged over all instances of a vowel for a scenario. The preliminary results show that a vowel can be robustly identified from the anterior part of the vocal tract, even if the constriction, crucial for the acoustics, is located in the posterior part. Our approach demonstrates the potential of Deep Learning as a tool to increase knowledge in speech production.


Year: 2023
In session: Speech Synthesis and Production
Pages: 181 to 188