Domain Adaptation with Augmented Data by Deep Neural Network Based Method Using Re-Recorded Speech for Automatic Speech Recognition in Real Environment

Raufun Nahar; Shogo Miwa; Atsuhiko Kai

doi:10.3390/s22249945

Domain Adaptation with Augmented Data by Deep Neural Network Based Method Using Re-Recorded Speech for Automatic Speech Recognition in Real Environment

Sensors (Basel). 2022 Dec 16;22(24):9945. doi: 10.3390/s22249945.

Authors

Raufun Nahar¹, Shogo Miwa², Atsuhiko Kai¹

Affiliations

¹ Graduate School of Science and Technology, Shizuoka University, Hamamatsu 432-8561, Japan.
² Graduate School of Integrated Science and Technology, Shizuoka University, Hamamatsu 432-8561, Japan.

Abstract

The most effective automatic speech recognition (ASR) approaches are based on artificial neural networks (ANN). ANNs need to be trained with an adequate amount of matched conditioned data. Therefore, performing training adaptation of an ASR model using augmented data of matched condition as the real environment gives better results for real data. Real-world speech recordings can vary in different acoustic aspects depending on the recording channels and environment such as the Long Term Evolution (LTE) channel of mobile telephones, where data are transmitted with voice over LTE (VoLTE) technology, wireless pin mics in a classroom condition, etc. Acquiring data with such variation is costly. Therefore, we propose training ASR models with simulated augmented data and fine-tune them for domain adaptation using deep neural network (DNN)-based simulated data along with re-recorded data. DNN-based feature transformation creates realistic speech features from recordings of clean conditions. In this research, a comparative investigation is performed for different recording channel adaptation methods for real-world speech recognition. The proposed method yields 27.0% character error rate reduction (CERR) for the DNN-hidden Markov model (DNN-HMM) hybrid ASR approach and 36.4% CERR for the end-to-end ASR approach for the target domain of the LTE channel of telephone speech.

Keywords: ASR; DNN; VoLTE; classroom recording; data augmentation; feature transformation; real environment; recording alignment.

MeSH terms

Language
Neural Networks, Computer
Speech Perception*
Speech Recognition Software
Speech*

Grants and funding

This work was partially supported by JSPS KAKENHI Grant Number JP18K11431.