μ-law SGAN for generating spectra with more details in speech enhancement

Hongfeng Li; Yanyan Xu; Dengfeng Ke; Kaile Su

doi:10.1016/j.neunet.2020.12.017

μ-law SGAN for generating spectra with more details in speech enhancement

Neural Netw. 2021 Apr:136:17-27. doi: 10.1016/j.neunet.2020.12.017. Epub 2020 Dec 25.

Authors

Hongfeng Li¹, Yanyan Xu², Dengfeng Ke³, Kaile Su⁴

Affiliations

¹ School of Information Science and Technology, Beijing Forestry University, 35 Qing-Hua East Road, Beijing 100083, China; Engineering Research Center for Forestry-oriented Intelligent Information Processing of National Forestry and Grassland Administration, Beijing 100083, China. Electronic address: lihongfeng@bjfu.edu.cn.
² School of Information Science and Technology, Beijing Forestry University, 35 Qing-Hua East Road, Beijing 100083, China; Engineering Research Center for Forestry-oriented Intelligent Information Processing of National Forestry and Grassland Administration, Beijing 100083, China. Electronic address: xuyanyan@bjfu.edu.cn.
³ School of Information Science, Beijing Language and Culture University, Beijing 100083, China. Electronic address: dengfeng.ke@blcu.edu.cn.
⁴ Institute for Integrated and Intelligent Systems, Griffith University, Nathan, QLD 4111, Australia. Electronic address: k.su@griffith.edu.au.

PMID: 33422929
DOI: 10.1016/j.neunet.2020.12.017

Abstract

The goal of monaural speech enhancement is to separate clean speech from noisy speech. Recently, many studies have employed generative adversarial networks (GAN) to deal with monaural speech enhancement tasks. When using generative adversarial networks for this task, the output of the generator is a speech waveform or a spectrum, such as a magnitude spectrum, a mel-spectrum or a complex-valued spectrum. The spectra generated by current speech enhancement methods in the time-frequency domain usually lack details, such as consonants and harmonics with low energy. In this paper, we propose a new type of adversarial training framework for spectrum generation, named μ-law spectrum generative adversarial networks (μ-law SGAN). We introduce a trainable μ-law spectrum compression layer (USCL) into the proposed discriminator to compress the dynamic range of the spectrum. As a result, the compressed spectrum can display more detailed information. In addition, we use the spectrum transformed by USCL to regularize the generator's training, so that the generator can pay more attention to the details of the spectrum. Experimental results on the open dataset Voice Bank + DEMAND show that μ-law SGAN is an effective generative adversarial architecture for speech enhancement. Moreover, visual spectrogram analysis suggests that μ-law SGAN pays more attention to the enhancement of low energy harmonics and consonants.

Keywords: -law SGAN; Deep neural networks; Generative adversarial networks; Signal processing; Speech enhancement.

MeSH terms

Data Compression / methods
Deep Learning*
Humans
Neural Networks, Computer*
Speech / physiology
Speech Perception / physiology*
Speech Recognition Software*