Speech Fundamental Period Estimation using a Neural Network

Authors: Ian S. Howard


Here we extend previous work for the estimation of the time of excitation (Tx) from the speech signal using a shallow neural network. We make use of a dataset that consists of the simultaneously recorded speech and Laryngograph signals from drama students speaking a phonetically balanced passage. We first use the Laryngograph signal to estimate the location of vocal fold closures as a function of time. Then, by considering the problem as a supervised learning task, we train a multilayer perceptron to map between raw speech samples, selected using a sliding input window, to a single output target sample that represents the presence or absence of an excitation point. We present result of operation across several male speakers and also demonstrate that it is possible to reconstruct the Laryngograph directly from the speech signal.

Year: 2020
In session: Speech Synthesis
Pages: 44 to 51