On the use of automatic speech recognizers for the quality and intelligibility prediction of synthetic speech


In this paper we investigate the use of an automatic speech recognizer (Google Speech API) for the prediction of quality and intelligibility of synthetic speech. For two databases of rated synthetic speech samples, we analyze the correlation of the word error rates (WER) obtained from the recognizer for each sample with ratings on 16 different attribute scales. Moderate correlations are observed for various quality aspects including overall impression, naturalnesss, and intelligibililty. Moreover, we analyze in a third database the correlation between intelligibility by a human, as determined in a test with semantically unpredictable sentences, and the WER of the recognizer. The correlation between the humans’ and the recognizer’s WER over all samples is .40, and .94 if averaged by TTS system.

Year: 2015
In session: Sprachsynthese
Pages: 105 to 111