Learning the Relative Dynamic Features for Word-Level Lipreading

Hao Li; Nurbiya Yadikar; Yali Zhu; Mutallip Mamut; Kurban Ubul

doi:10.3390/s22103732

Learning the Relative Dynamic Features for Word-Level Lipreading

Sensors (Basel). 2022 May 13;22(10):3732. doi: 10.3390/s22103732.

Authors

Hao Li¹, Nurbiya Yadikar^{1

2}, Yali Zhu¹, Mutallip Mamut³, Kurban Ubul^{1

2}

Affiliations

¹ School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China.
² Xinjiang Key Laboratory of Multilingual Information Processing, Urumqi 830046, China.
³ Technology Department, Library of Xinjiang University, Urumqi 830046, China.

Abstract

Lipreading is a technique for analyzing sequences of lip movements and then recognizing the speech content of a speaker. Limited by the structure of our vocal organs, the number of pronunciations we could make is finite, leading to problems with homophones when speaking. On the other hand, different speakers will have various lip movements for the same word. For these problems, we focused on the spatial-temporal feature extraction in word-level lipreading in this paper, and an efficient two-stream model was proposed to learn the relative dynamic information of lip motion. In this model, two different channel capacity CNN streams are used to extract static features in a single frame and dynamic information between multi-frame sequences, respectively. We explored a more effective convolution structure for each component in the front-end model and improved by about 8%. Then, according to the characteristics of the word-level lipreading dataset, we further studied the impact of the two sampling methods on the fast and slow channels. Furthermore, we discussed the influence of the fusion methods of the front-end and back-end models under the two-stream network structure. Finally, we evaluated the proposed model on two large-scale lipreading datasets and achieved a new state-of-the-art.

Keywords: Visual Speech Recognition; lipreading; spatial–temporal feature extraction.

MeSH terms

Algorithms*
Humans
Learning
Lipreading*
Motion
Movement

Abstract

MeSH terms

Grants and funding