Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition

Geon Woo Lee; Hong Kook Kim

doi:10.3390/s22145381

Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition

Sensors (Basel). 2022 Jul 19;22(14):5381. doi: 10.3390/s22145381.

Authors

Geon Woo Lee¹, Hong Kook Kim^{1

2}

Affiliations

¹ AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, Korea.
² School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, Korea.

Abstract

In this paper, a new two-step joint optimization approach based on the asynchronous subregion optimization method is proposed for training a pipeline model composed of two different models. The first-step processing of the proposed joint optimization approach trains the front-end model only, and the second-step processing trains all the parameters of the combined model together. In the asynchronous subregion optimization method, the first-step processing only supports the goal of the front-end model. However, the first-step processing of the proposed approach works with a new loss function to make the front-end model support the goal of the back-end model. The proposed optimization approach was applied, here, to a pipeline composed of a deep complex convolutional recurrent network (DCCRN)-based speech enhancement model and a conformer-transducer-based ASR model as a front-end and a back-end, respectively. Then, the performance of the proposed two-step joint optimization approach was evaluated on the LibriSpeech automatic speech recognition (ASR) corpus in noisy environments by measuring the character error rate (CER) and word error rate (WER). In addition, an ablation study was carried out to examine the effectiveness of the proposed optimization approach on each of the processing blocks in the conformer-transducer ASR model. Consequently, it was shown from the ablation study that the conformer-transducer-based ASR model with the joint network trained only by the proposed optimization approach achieved the lowest average CER and WER. Moreover, the proposed optimization approach reduced the average CER and WER on the Test-Noisy dataset under matched noise conditions by 0.30% and 0.48%, respectively, compared to the approach of separate optimization of speech enhancement and ASR. Compared to the conventional two-step joint optimization approach, the proposed optimization approach provided average CER and WER reductions of 0.22% and 0.31%, respectively. Moreover, it was revealed that the proposed optimization approach achieved a lower average CER and WER, by 0.32% and 0.43%, respectively, than the conventional optimization approach under mismatched noise conditions.

Keywords: auxiliary loss function; joint optimization; noise-robust speech recognition; speech enhancement.

MeSH terms

Noise
Speech Perception*
Speech Recognition Software
Speech*

Grants and funding

UD190031RD/This work was conducted by Center for Applied Research in Artificial Intelligence(CARAI) grant funded by DAPA and ADD