Gender spectrum data from podcasts – a proof of concept

Jan Marquenie; Mareile Leonhardt; Sven Grawunder; Ingo Siegert

Gender spectrum data from podcasts – a proof of concept

Authors: Jan Marquenie, Mareile Leonhardt, Sven Grawunder, Ingo Siegert

Abstract:

Bias in speech recognition systems persists, particularly regarding gender identities and sexual orientations. Although recent efforts have diversified datasets by addressing age, language, ethnicity, and recording conditions, LGBTQIA+ speakers remain underrepresented. To help fill this gap, we investigated the feasibility of using publicly accessible podcasts featuring LGBTQIA+ persons to compile a corpus of 126 speakers. We propose a semi-automatic gathering process starting with automatic diarization of each episode and with successive identification of the hosts, linking metadata and guest information from the episode information, followed by a manual revision or addition of the labels. Our findings highlight podcast data as a promising avenue for capturing the diversity needed to mitigate bias in speech technology and foster more equitable voice systems.

Year: 2025
In session: Poster
Pages: 239 to 246