A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

Sensors (Basel). 2023 Oct 27;23(21):8770. doi: 10.3390/s23218770.

Abstract

The cocktail party problem can be more effectively addressed by leveraging the speaker's visual and audio information. This paper proposes a method to improve the audio's separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR.

Keywords: U-Net; attention mechanism; audio-visual; speech separation.

MeSH terms

  • Cues
  • Lip
  • Movement
  • Speech Perception*
  • Speech*