Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion

Baijun Xie; Mariia Sidulova; Chung Hyuk Park

doi:10.3390/s21144913

Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion

Sensors (Basel). 2021 Jul 19;21(14):4913. doi: 10.3390/s21144913.

Authors

Baijun Xie¹, Mariia Sidulova¹, Chung Hyuk Park¹

Affiliation

¹ Department of Biomedical Engineering, School of Engineering and Applied Science, George Washington University, Washington, DC 20052, USA.

Abstract

Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for multimodal emotion recognition during a conversation. Three separate models for audio, video and text modalities are structured and fine-tuned on the MELD. In this paper, a transformer-based crossmodality fusion with the EmbraceNet architecture is employed to estimate the emotion. The proposed multimodal network architecture can achieve up to 65% accuracy, which significantly surpasses any of the unimodal models. We provide multiple evaluation techniques applied to our work to show that our model is robust and can even outperform the state-of-the-art models on the MELD.

Keywords: attention mechanism; crossmodal transformer; multimodal emotion recognition; multimodal fusion.

MeSH terms

Communication
Emotions*
Physical Therapy Modalities
Recognition, Psychology*

Grants and funding

1846658/National Science Foundation