Evaluating Commercial and Open Source Text-to-Speech Synthesis Considering Specifc Problem Classes

Authors: Felix Burkhardt


Current state-of-the-art speech synthesizers for domain-independent systems still struggle with the challenge of generating understandable and naturalsounding speech. This is mainly because the pronunciation of words of foreign origin, inflections and compound words often can’t be handled by rules and there are also too many of them for inclusion in exception dictionaries. We describe an approach to evaluate Text-to-Speech Synthesizers with a subjective listening experiment. The focus is to differentiate between known problem classes for speech synthesizers. We distinguish the following problem classes: Abbreviations, acronyms, acronym-abbreviations, addresses, compounds, dialectal expressions, exclamations, foreign origin words, German English (Denglisch), hetero-phonic homographs, named entities, inflected verbs, numbers, units and dates, rare words. We included some longer texts like short news feeds or e-mails. Because as a rule databased speech synthesizers perform very different depending on how the target text fits to the data model, a large number of sentences must be tested in order to minimize the chance factor for the test sentences being part of the synthesizer’s training data.Word lists for each of the above mentioned categories were compiled and synthesized by a commercial and an open source synthesizer, both being based on the non-uniform unit-selection approach. The synthesized speech was evaluated by a human judge using the Speechalyzer toolkit and the results are discussed. It shows that especially words of foreign origin were pronounced badly by both systems.

Year: 2015
In session: Sprachsynthese
Pages: 120 to 127