A Deep Learning Based Approach to Synthesize Intelligible Speech with Limited Temporal Envelope Information

Ching-Ju Hsiao; Fei Chen; Ji-Yan Han; Wei-Zhong Zheng; Ying-Hui Lai

doi:10.1109/EMBC48229.2022.9871247

A Deep Learning Based Approach to Synthesize Intelligible Speech with Limited Temporal Envelope Information

Annu Int Conf IEEE Eng Med Biol Soc. 2022 Jul:2022:1972-1976. doi: 10.1109/EMBC48229.2022.9871247.

Authors

Ching-Ju Hsiao, Fei Chen, Ji-Yan Han, Wei-Zhong Zheng, Ying-Hui Lai

PMID: 36086160
DOI: 10.1109/EMBC48229.2022.9871247

Abstract

Envelope waveforms can be extracted from multiple frequency bands of a speech signal, and envelope waveforms carry important intelligibility information for human speech communication. This study aimed to investigate whether a deep learning-based model with features of temporal envelope information could synthesize an intelligible speech, and to study the effect of reducing the number (from 8 to 2 in this work) of temporal envelope information on the intelligibility of the synthesized speech. The objective evaluation metric of short-time objective intelligibility (STOI) showed that, on average, the synthesized speech of the proposed approach provided higher STOI (i.e., 0.8) scores in each test condition; and the human listening test showed that the average word correct rate of eight listeners was higher than 97.5%. These findings indicated that the proposed deep learning-based system can be a potential approach to synthesize a highly intelligible speech with limited envelope information in the future.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Auditory Perception
Deep Learning*
Humans
Speech Intelligibility
Speech Perception*
Time Factors