Attention-Based Joint Training of Noise Suppression and Sound Event Detection for Noise-Robust Classification

Jin-Young Son; Joon-Hyuk Chang

doi:10.3390/s21206718

Attention-Based Joint Training of Noise Suppression and Sound Event Detection for Noise-Robust Classification

Sensors (Basel). 2021 Oct 9;21(20):6718. doi: 10.3390/s21206718.

Authors

Jin-Young Son¹, Joon-Hyuk Chang¹

Affiliation

¹ Department of Electronic Engineering, Hanyang University, Seoul 04763, Korea.

Abstract

Sound event detection (SED) recognizes the corresponding sound event of an incoming signal and estimates its temporal boundary. Although SED has been recently developed and used in various fields, achieving noise-robust SED in a real environment is typically challenging owing to the performance degradation due to ambient noise. In this paper, we propose combining a pretrained time-domain speech-separation-based noise suppression network (NS) and a pretrained classification network to improve the SED performance in real noisy environments. We use group communication with a context codec method (GC3)-equipped temporal convolutional network (TCN) for the noise suppression model and a convolutional recurrent neural network for the SED model. The former significantly reduce the model complexity while maintaining the same TCN module and performance as a fully convolutional time-domain audio separation network (Conv-TasNet). We also do not update the weights of some layers (i.e., freeze) in the joint fine-tuning process and add an attention module in the SED model to further improve the performance and prevent overfitting. We evaluate our proposed method using both simulation and real recorded datasets. The experimental results show that our method improves the classification performance in a noisy environment under various signal-to-noise-ratio conditions.

Keywords: attention; deep neural network; joint training; noise suppression; noise-robust classification; sound event detection.

MeSH terms

Neural Networks, Computer*
Noise*
Signal-To-Noise Ratio
Sound
Speech