A Deep Learning Based Approach to Synthesize Intelligible Speech with Limited Temporal Envelope Information

Annu Int Conf IEEE Eng Med Biol Soc. 2022 Jul:2022:1972-1976. doi: 10.1109/EMBC48229.2022.9871247.

Abstract

Envelope waveforms can be extracted from multiple frequency bands of a speech signal, and envelope waveforms carry important intelligibility information for human speech communication. This study aimed to investigate whether a deep learning-based model with features of temporal envelope information could synthesize an intelligible speech, and to study the effect of reducing the number (from 8 to 2 in this work) of temporal envelope information on the intelligibility of the synthesized speech. The objective evaluation metric of short-time objective intelligibility (STOI) showed that, on average, the synthesized speech of the proposed approach provided higher STOI (i.e., 0.8) scores in each test condition; and the human listening test showed that the average word correct rate of eight listeners was higher than 97.5%. These findings indicated that the proposed deep learning-based system can be a potential approach to synthesize a highly intelligible speech with limited envelope information in the future.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Auditory Perception
  • Deep Learning*
  • Humans
  • Speech Intelligibility
  • Speech Perception*
  • Time Factors