How to Identify Speech when Translating Unpunctuated Poetry

Abstract:

A large proportion of (post)-modern poetry contains no or hardly any punctuation. In our contribution, we will investigate how well punctuation information can be recovered for postmodern poetry based on the information contained in the text and speech of free verse poems. We use the world's largest corpus of spoken (post-)modern poetry from our partner lyrikline which contains the corresponding audio recording of each poem as spoken by the original author and features translations for many of the poems. We identify lines that contain a phrase break in the middle of the poetic line, which may already be helpful for philological analysis on one hand, and identify the position of the break in the line on the other hand. We select those poetic lines that contain one or more punctuation characters that typically indicate a phrase break in poetry (.,;:!?/) somewhere in the middle (rather than only at the end of the line) as our target class. We train a neural network (bidirectional recurrent neural network (RNN) based on gated recurrent units (GRU) with attention) that combines audio and textual features to identify the punctuation with the goal of applying it to reconstruct them within a corpus of unpunctuated poems. Our results clearly indicate that speech is helpful for recovering the constituency structure of post-modern poetry that is partially obfuscated by missing punctuation.


Year: 2020
In session: Poster
Pages: 165 to 172