Convolutional neural networks can learn duration for detecting pitch accents and lexical stress

Abstract:

The duration of syllables or words is an important correlate of prosodyand often used as a feature for automatic pitch accent detection. We have previ-ously introduced a method for pitch accent detection using a convolutional neuralnetwork (CNN) that yields good performance using low-level acoustic descriptorsalone, without any explicit duration information. In this paper, we use this modelfor various pitch accent and lexical stress detection tasks at the word and syllablelevel on the DIRNDL German radio news corpus. We show that information onword or syllable duration is encoded in the high-level CNN feature representationby training a linear regression model on these features to predict duration. The factthat this can be approximated suggests that the CNN makes use of implicit dura-tion information that is derived from the frame-based input. We also observe thatduration is only learnt in tasks where it is directly correlated with the target label.We compare two different methods of pooling that capture the input informationdifferently and show how this affects what is encoded in the output representation.


Year: 2019
In session: Spracherkennung und -wahrnehmung
Pages: 17 to 24