Cross-cultural recognition of auditive feedback with echo state networks

Abstract:

This paper deals with the development of a classifier to distinguish between positive and negative feedback from the user in human-machine-interaction. We calculate prosodic features from the user’s utterances and feed it to an Echo State Network, a dynamic classifier that is able to learn temporal dependencies implicitly. The data were recorded in a test scenario from German and Japanese test subjects, once in natural speech and once in an artificial “language” that uses only the syllable “na”. The test subjects had to give feedback to a simulation of the robot Flobi and were instructed to behave like interacting with a child. The implemented Echo State Network proved to be able to learn to classify the feedback of a single person into the two categories “positive” and “negative” and could generalize to a certain extent. We experience a high range of different feedback in the data, intra-culturally as well as inter-culturally. However, it can be shown that a classifier trained on German data works significantly better on German data than on the Japanese, indicating that cultural differences exist. Analyzing different feature subsets, we found out that using Mel-Frequency Cepstral Coefficients as features yield a better classification rate than using prosodic features (like pitch and intensity) alone.


Year: 2013
In session: Prosodischer und multimodaler Ausdruck in der Mensch-Maschine Interaktion
Pages: 173 to 180