On the Use of Fujisaki Parameters for the Quality Prediction of Synthetic Speech


This paper presents research on the use of Fujisaki parameters for the quality prediction of synthetic speech. The Fujisaki model describes the pitch contour of a speech signal through the parameters base frequency, phrase commands, and accent commands. While the base frequency represents the minimum F0 value in the signal, the phrase commands describe the slowly varying components and the accent commands indicate local peaks in the contour. The Fujisaki parameters were assessed for four independent auditory evaluated databases consisting of synthetic speech generated by over 20 different text-tospeech (TTS) systems. The prosody generation techniques of these systems is unknown to us, i.e. it may happen that the systems base their prosody on a Fujisakilike model or not. The extracted parameters were used to calculate 47 features (e.g. mean distance between phrase commands, variance of accent command amplitudes, etc.). A stepwise multiple linear regression of these features with the overall quality judgement (MOS) as the response variable led to one quality prediction model per gender. A leave-one-out cross-validation showed the stability of both models. The Pearson Correlation R between predicted MOS and auditory MOS was computed per gender and database. The mean correlation reached a value of R > .50. Even though, the computed Fujisaki features do not fully capture the auditory quality of TTS stimuli both models will be helpful for predicting TTS quality. Especially, in combination with other features an increase in accuracy is to be expected.

Year: 2012
In session: Sprach- und Signalverarbeitung
Pages: 112 to 119