Multimodal fusion diagnosis of depression and anxiety based on CNN-LSTM model

Comput Med Imaging Graph. 2022 Dec:102:102128. doi: 10.1016/j.compmedimag.2022.102128. Epub 2022 Oct 4.

Abstract

Background: In recent years, more and more people suffer from depression and anxiety. These symptoms are hard to be spotted and can be very dangerous. Currently, the Self-Reported Anxiety Scale (SAS) and Self-Reported Depression Scale (SDS) are commonly used for initial screening for depression and anxiety disorders. However, the information contained in these two scales is limited, while the symptoms of subjects are various and complex, which results in the inconsistency between the questionnaire evaluation results and the clinician's diagnosis results. To fully mine the scale data, we propose a method to extract the features from the facial expression and movements, which are generated from the video recorded simultaneously when subjects fill in the scale. Then we collect the facial expression, movements and scale information to establish a multimodal framework for improving the accuracy and robustness of the diagnosis of depression and anxiety.

Methods: We collect the scale results of the subjects and the videos when filling in the scales. Given the two scales, SAS and SDS, we construct a model with two branches, where each branch processes the multimodal data of SAS and SDS, respectively. In the branch, we first build a convolutional neural network (CNN) to extracts the facial expression features in each frame of images. Secondly, we establish a long short-term memory (LSTM) network to further embedding the facial expression feature and build the connections between various frames, so that the movement feature in the video can be generated. Thirdly, we transform the scale scores into one-hot format, and feed them into the corresponding branch of the network to further mining the information of the multimodal data. Finally, we fuse the embeddings of these two branches to generate inference results of depression and anxiety.

Results and conclusions: Based on the score results of SAS and SDS, our multimodal model further mines the video information, and can reach the accuracy of 0.946 in diagnosing depression and anxiety. This study demonstrates the feasibility of using our CNN-LSTM-based multimodal model for initial screening and diagnosis of depression and anxiety disorders with high diagnostic performance.

Keywords: Anxiety; Convolutional neural network; Depression; Long short-term memory; Multimodal fusion.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Anxiety / diagnosis
  • Anxiety Disorders
  • Depression* / diagnosis
  • Humans
  • Neural Networks, Computer*
  • Surveys and Questionnaires