Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications

Sensors (Basel). 2022 Oct 12;22(20):7738. doi: 10.3390/s22207738.

Abstract

Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user-system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafés, museums, music halls, and kiosks.

Keywords: audiovisual speech recognition; deep learning; edutainment; lipreading; multimodal interaction; virtual aquarium.

MeSH terms

  • Noise
  • Signal-To-Noise Ratio
  • Speech Perception*
  • Speech Recognition Software
  • Speech*