Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification

Abderrazzaq Moufidi; David Rousseau; Pejman Rasti

doi:10.3390/s23135890

Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification

Sensors (Basel). 2023 Jun 25;23(13):5890. doi: 10.3390/s23135890.

Authors

Abderrazzaq Moufidi^{1

2}, David Rousseau², Pejman Rasti^{1

2}

Affiliations

¹ Centre d'Études et de Recherche pour l'Aide à la Décision (CERADE), ESAIP, 18 Rue du 8 Mai 1945, 49124 Saint-Barthélemy-d'Anjou, France.
² Laboratoire Angevin de Recherche en Ingénierie des Systèmes (LARIS), UMR INRAe-IRHS, Université d'Angers, 62 Avenue Notre Dame du Lac, 49000 Angers, France.

Abstract

Multimodal deep learning, in the context of biometrics, encounters significant challenges due to the dependence on long speech utterances and RGB images, which are often impractical in certain situations. This paper presents a novel solution addressing these issues by leveraging ultrashort voice utterances and depth videos of the lip for person identification. The proposed method utilizes an amalgamation of residual neural networks to encode depth videos and a Time Delay Neural Network architecture to encode voice signals. In an effort to fuse information from these different modalities, we integrate self-attention and engineer a noise-resistant model that effectively manages diverse types of noise. Through rigorous testing on a benchmark dataset, our approach exhibits superior performance over existing methods, resulting in an average improvement of 10%. This method is notably efficient for scenarios where extended utterances and RGB images are unfeasible or unattainable. Furthermore, its potential extends to various multimodal applications beyond just person identification.

Keywords: depth images; late fusion; lip identification; multimodality; spatiotemporal; speaker identification.

MeSH terms

Biometry
Humans
Neural Networks, Computer
Noise
Videotape Recording
Voice*

Grants and funding

The PhD grant of A. M. is funded by Angers Loire Metropole (ALM).