CompNet: Complementary network for single-channel speech enhancement

Cunhang Fan; Hongmei Zhang; Andong Li; Wang Xiang; Chengshi Zheng; Zhao Lv; Xiaopei Wu

doi:10.1016/j.neunet.2023.09.041

CompNet: Complementary network for single-channel speech enhancement

Neural Netw. 2023 Nov:168:508-517. doi: 10.1016/j.neunet.2023.09.041. Epub 2023 Sep 25.

Authors

Cunhang Fan¹, Hongmei Zhang¹, Andong Li², Wang Xiang¹, Chengshi Zheng², Zhao Lv³, Xiaopei Wu⁴

Affiliations

¹ Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230601, China.
² Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, 100190, Beijing, China.
³ Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230601, China. Electronic address: kjlz@ahu.edu.cn.
⁴ Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230601, China. Electronic address: wxp2001@ahu.edu.cn.

PMID: 37832318
DOI: 10.1016/j.neunet.2023.09.041

Abstract

Recent multi-domain processing methods have demonstrated promising performance for monaural speech enhancement tasks. However, few of them explain why they behave better over single-domain approaches. As an attempt to fill this gap, this paper presents a complementary single-channel speech enhancement network (CompNet) that demonstrates promising denoising capabilities and provides a unique perspective to understand the improvements introduced by multi-domain processing. Specifically, the noisy speech is initially enhanced through a time-domain network. However, despite the waveform can be feasibly recovered, the distribution of the time-frequency bins may still be partly different from the target spectrum when we reconsider the problem in the frequency domain. To solve this problem, we design a dedicated dual-path network as a post-processing module to independently filter the magnitude and refine the phase. This further drives the estimated spectrum to closely approximate the target spectrum in the time-frequency domain. We conduct extensive experiments with the WSJ0-SI84 and VoiceBank + Demand datasets. Objective test results show that the performance of the proposed system is highly competitive with existing systems.

Keywords: Complementary; Filtering and refining; Speech enhancement; Time-domain; Time–frequency domain.

MeSH terms

Algorithms*
Noise
Signal-To-Noise Ratio
Speech*