An Approach to Improving Robustness in Dynamic Acoustic Environments: Context Noise Representation Learning for Urban Speech Emotion Recognition

Abstract:

In modern urban environments, speech recognition systems often face significant degradation due to background noise. Conventional approaches often rely on signal enhancement or generative error correction, which can inadvertently remove high-level emotional cues essential for understanding user intent. In this work, we propose a context noise representation learning (CNRL) framework that enhances robustness by aligning noisy speech representations with their clean counterparts in the latent space. By leveraging the conversational context and a feature fusion strategy, our model learns to recover clean emotional features. Evaluated on the IEMOCAP dataset using a strict Leave-One-Session-Out (LOSO) protocol, our method demonstrates improved robustness in low-SNR conditions compared to baseline approaches.


Year: 2026
In session: Speech Signal Recognition and Enhancement
Pages: 40 to 46