Multimodal automatic assessment of acute pain through facial videos and heart rate signals utilizing transformer-based architectures

Stefanos Gkikas; Nikolaos S Tachos; Stelios Andreadis; Vasileios C Pezoulas; Dimitrios Zaridis; George Gkois; Anastasia Matonaki; Thanos G Stavropoulos; Dimitrios I Fotiadis

doi:10.3389/fpain.2024.1372814

Multimodal automatic assessment of acute pain through facial videos and heart rate signals utilizing transformer-based architectures

Front Pain Res (Lausanne). 2024 Mar 27:5:1372814. doi: 10.3389/fpain.2024.1372814. eCollection 2024.

Authors

Stefanos Gkikas^{1

2}, Nikolaos S Tachos^{3

4}, Stelios Andreadis⁵, Vasileios C Pezoulas³, Dimitrios Zaridis^{3

4}, George Gkois³, Anastasia Matonaki⁵, Thanos G Stavropoulos⁵, Dimitrios I Fotiadis^{3

4}

Affiliations

¹ Computational BioMedicine Laboratory (CBML), Institute of Computer Science, Foundation for Research and Technology - Hellas (FORTH), Heraklion, Greece.
² Department of Electrical & Computer Engineering, Hellenic Mediterranean University, Heraklion, Greece.
³ Biomedical Research Institute, Foundation for Research and Technology - Hellas (FORTH), Ioannina, Greece.
⁴ Unit of Medical Technology and Intelligent Information Systems, Department of Materials Science and Engineering, University of Ioannina, Ioannina, Greece.
⁵ Pfizer Center for Digital Innovation, Thessaloniki, Greece.

Abstract

Accurate and objective pain evaluation is crucial in developing effective pain management protocols, aiming to alleviate distress and prevent patients from experiencing decreased functionality. A multimodal automatic assessment framework for acute pain utilizing video and heart rate signals is introduced in this study. The proposed framework comprises four pivotal modules: the Spatial Module, responsible for extracting embeddings from videos; the Heart Rate Encoder, tasked with mapping heart rate signals into a higher dimensional space; the AugmNet, designed to create learning-based augmentations in the latent space; and the Temporal Module, which utilizes the extracted video and heart rate embeddings for the final assessment. The Spatial-Module undergoes pre-training on a two-stage strategy: first, with a face recognition objective learning universal facial features, and second, with an emotion recognition objective in a multitask learning approach, enabling the extraction of high-quality embeddings for the automatic pain assessment. Experiments with the facial videos and heart rate extracted from electrocardiograms of the BioVid database, along with a direct comparison to 29 studies, demonstrate state-of-the-art performances in unimodal and multimodal settings, maintaining high efficiency. Within the multimodal context, 82.74% and 39.77% accuracy were achieved for the binary and multi-level pain classification task, respectively, utilizing $9.62$ million parameters for the entire framework.

Keywords: ECG; data fusion; deep learning; pain recognition; vision transformer.

Grants and funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article.