Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation

Zhong-Qiu Wang; Peidong Wang; DeLiang Wang

doi:10.1109/taslp.2021.3083405

Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation

IEEE/ACM Trans Audio Speech Lang Process. 2021:29:2001-2014. doi: 10.1109/taslp.2021.3083405. Epub 2021 May 26.

Authors

Zhong-Qiu Wang¹, Peidong Wang², DeLiang Wang³

Affiliations

¹ Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA, while performing this work. He is now with Mitsubishi Electric Research Laboratories, Cambridge, MA 02139, USA.
² Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA.
³ Department of Computer Science and Engineering & the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277 USA.

Abstract

We propose multi-microphone complex spectral mapping, a simple way of applying deep learning for time-varying non-linear beamforming, for speaker separation in reverberant conditions. We aim at both speaker separation and dereverberation. Our study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation (CSS). Assuming a fixed array geometry between training and testing, we train deep neural networks (DNN) to predict the real and imaginary (RI) components of target speech at a reference microphone from the RI components of multiple microphones. We then integrate multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation, and combine it with frame-level speaker counting for block-online CSS. Although our system is trained on simulated room impulse responses (RIR) based on a fixed number of microphones arranged in a given geometry, it generalizes well to a real array with the same geometry. State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.

Keywords: Complex spectral mapping; deep learning; microphone array processing; speaker separation.

Grants and funding

R01 DC012048/DC/NIDCD NIH HHS/United States