Coupled Multimodal Emotional Feature Analysis Based on Broad-Deep Fusion Networks in Human-Robot Interaction

Luefeng Chen; Min Li; Min Wu; Witold Pedrycz; Kaoru Hirota

doi:10.1109/TNNLS.2023.3236320

Coupled Multimodal Emotional Feature Analysis Based on Broad-Deep Fusion Networks in Human-Robot Interaction

IEEE Trans Neural Netw Learn Syst. 2023 Jan 20:PP. doi: 10.1109/TNNLS.2023.3236320. Online ahead of print.

Authors

Luefeng Chen, Min Li, Min Wu, Witold Pedrycz, Kaoru Hirota

PMID: 37021991
DOI: 10.1109/TNNLS.2023.3236320

Abstract

A coupled multimodal emotional feature analysis (CMEFA) method based on broad-deep fusion networks, which divide multimodal emotion recognition into two layers, is proposed. First, facial emotional features and gesture emotional features are extracted using the broad and deep learning fusion network (BDFN). Considering that the bi-modal emotion is not completely independent of each other, canonical correlation analysis (CCA) is used to analyze and extract the correlation between the emotion features, and a coupling network is established for emotion recognition of the extracted bi-modal features. Both simulation and application experiments are completed. According to the simulation experiments completed on the bimodal face and body gesture database (FABO), the recognition rate of the proposed method has increased by 1.15% compared to that of the support vector machine recursive feature elimination (SVMRFE) (without considering the unbalanced contribution of features). Moreover, by using the proposed method, the multimodal recognition rate is 21.22%, 2.65%, 1.61%, 1.54%, and 0.20% higher than those of the fuzzy deep neural network with sparse autoencoder (FDNNSA), ResNet-101 + GFK, C3D + MCB + DBN, the hierarchical classification fusion strategy (HCFS), and cross-channel convolutional neural network (CCCNN), respectively. In addition, preliminary application experiments are carried out on our developed emotional social robot system, where emotional robot recognizes the emotions of eight volunteers based on their facial expressions and body gestures.