ESSV Konferenz Elektronische Sprachsignalverarbeitung

Title: Dynamic vocabulary with a Kaldi speech recognizer in a speech dialog system for automotive infotainment applications

Authors: Thomas Ranzenberger, Christian Hacker, Karl Weilhammer


In this paper we present an evaluation of the Kaldi speech recognizer us-ing dynamic vocabulary in an automotive context. We updated our previously inte-grated Kaldi speech recognizer and make use of a new available decoding methodtogether with a new special type of weighted finite state transducer that allows usto evaluate the usage of dynamic vocabulary. We use an existing Kaldi referencemodel for English and extend it to recognize names of a contact list and create asecond model to recognize radio stations with a language model reduced to wordsfor this specific domain. The contact list models are based on the librispeech corpuswith two hundred thousand words and will be extended with forty, eighty and onehundred twenty words. We measured the time for modifying the reference modelwith the dynamic vocabulary. It took fourteen seconds for the biggest vocabu-lary model with additional one hundred twenty words. We tested the word errorrate of the models on the librispeech corpus. The word error rates did not signifi-cantly change in comparison to our reference model. We extended the processingof the recognition result to detect the slots and match them with a list of slots. Weevaluated the sentence error rate, slot detection error rate and intent error rate ofthe forty words contact list and the radio station model. 10 participants spoke 25random sentences of a self created corpus of example sentences. All participantswere non-native speakers. The sentences contained words of the librispeech cor-pus. Common names and stations of the united states were added which were notin the baseline librispeech language model. For the radio station model we usedan out of vocabulary placeholder in our sentences to test the intent mapping. Thecontact list model with forty words had a sentence error rate of 52.00%, a slotdetection error rate of 34.80% and a intent detection error rate of 11.60%. Theparticipants had problems with the pronunciation of the country and region spe-cific names which might origin also from outside of the united states. The domainspecific radio station model had a sentence error rate of 20.40%, a slot detectionerror rate of 8.00% and a intent detection error rate of 1.60%. Most stations werespoken letter by letter. The high slot recognition rate and intent recognition rate ofthe model is caused by a reduced vocabulary for the specific domain.

Year: 2019
In session: Poster und Demonstrationen
Pages: 255 to 262