A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

Guizhu Li; Min Fu; Mengnan Sun; Xuefeng Liu; Bing Zheng

doi:10.3390/s23218770

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

Sensors (Basel). 2023 Oct 27;23(21):8770. doi: 10.3390/s23218770.

Authors

Guizhu Li¹, Min Fu^{1

2}, Mengnan Sun¹, Xuefeng Liu³, Bing Zheng^{1

2}

Affiliations

¹ College of Electronic Engineering, Ocean University of China, Qingdao 266100, China.
² Sanya Oceanography Institution, Ocean University of China, Sanya 572024, China.
³ College of Automation and Electronic Engineering, Qingdao University of Science and Technology, Qingdao 266061, China.

Abstract

The cocktail party problem can be more effectively addressed by leveraging the speaker's visual and audio information. This paper proposes a method to improve the audio's separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR.

Keywords: U-Net; attention mechanism; audio-visual; speech separation.

MeSH terms

Cues
Lip
Movement
Speech Perception*
Speech*

Abstract

MeSH terms

Grants and funding