Multi-modal depression detection based on emotional audio and evaluation text

Jiayu Ye; Yanhong Yu; Qingxiang Wang; Wentao Li; Hu Liang; Yunshao Zheng; Gang Fu

doi:10.1016/j.jad.2021.08.090

Multi-modal depression detection based on emotional audio and evaluation text

J Affect Disord. 2021 Dec 1:295:904-913. doi: 10.1016/j.jad.2021.08.090. Epub 2021 Sep 2.

Authors

Jiayu Ye¹, Yanhong Yu², Qingxiang Wang³, Wentao Li¹, Hu Liang¹, Yunshao Zheng⁴, Gang Fu¹

Affiliations

¹ School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China.
² College of Traditional Chinese Medicine, Shandong University of Traditional Chinese Medicine, Jinan 250355, China.
³ School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China. Electronic address: wangqx@qlu.edu.cn.
⁴ Shandong Mental Health Center, Jinan 250014, China.

PMID: 34706461
DOI: 10.1016/j.jad.2021.08.090

Abstract

Background: Early detection of depression is very important for the treatment of patients. In view of the current inefficient screening methods for depression, the research of depression identification technology is a complex problem with application value.

Methods: Our research propose a new experimental method for depression detection based on audio and text. 160 Chinese subjects are investigated in this study. It is worth noting that we propose a text reading experiment to make subjects emotions change rapidly. It will be called Segmental Emotional Speech Experiment (SESE) below. We extract 384-dimensional Low-level audio features to find the differences of different emotional change in SESE. At the same time, our research propose a multi-modal fusion method based on DeepSpectrum features and word vector features to detect depression by using deep learning.

Results: Our experiment proved that SESE can improve the recognition accuracy of depression and found differences in Low-level audio features. Case group and Control group, gender and age are grouped for verification. It is also satisfactory that the multi-modal fusion model achieves accuracy of 0.912 and F1 score of 0.906.

Conclusions: Our contribution is twofold. First, we propose and verify SESE, which can provide a new experimental idea for the follow-up researchers. Secondly, a new efficient multi-modal depression recognition model is proposed.

Keywords: Artificial intelligence; Deep learning; Depression; Multi-modality.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Depression* / diagnosis
Emotions
Humans
Speech*