A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training

Weijing Meng; Nurmemet Yolwas

doi:10.3390/s23020870

A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training

Sensors (Basel). 2023 Jan 12;23(2):870. doi: 10.3390/s23020870.

Authors

Weijing Meng^{1

2}, Nurmemet Yolwas^{1

2}

Affiliations

¹ Xinjiang Multilingual Information Technology Laboratory, Urumqi 830017, China.
² College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China.

Abstract

Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech's test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.

Keywords: Factorized TDNN; automatic speech recognition; speech synthesis; unsupervised pre-training.

MeSH terms

Language
Noise
Speech Perception*
Speech Recognition Software
Speech*

Grants and funding

No. 62066043/National Natural Science Foundation of China