Modular text normalization pipeline for language model training

Abstract:

Language modeling plays an integral part in natural language processing tasks, and speech recognition applications especially require clean data for cohesive results. Most existing text normalization and data cleaning algorithms are strict end-to-end solutions and allow little customization. The presented text normalization pipeline is modular and configurable and can be applied to various text sources. The integration of additional steps into the pipeline that remove text garbage proved to be advantageous for text generation with language models that were generated solely with this pipeline. Furthermore, the combination of rule-based and machine learning processes proved to be effective in producing data faster than previous solutions.


Year: 2025
In session: Poster
Pages: 231 to 238