Unaligned Multimodal Sequences for Depression Assessment From Speech

Annu Int Conf IEEE Eng Med Biol Soc. 2022 Jul:2022:3409-3413. doi: 10.1109/EMBC48229.2022.9871556.

Abstract

A growing area of mental health research pertains to how an individual's degree of depression might be automatically assessed through analyzing multimodal-based objective markers. However, when combined with machine learning, this research can be challenging due to the existence of unaligned multimodal sequences and the limited amount of annotated training data. In this paper, a novel cross-modal framework for automatic depression severity assessment is proposed. The low-level descriptions (LLDs) from multiple clues (such as text, audio and video) are extracted, after which multimodal fusion via cross-modal attention mechanism is utilized to facilitate the learning of more accurate feature representations. For the features extracted from each modality, the cross-modal attention mechanism is utilized to continuously update the input sequence of the target mode, until the score of the patient's health questionnaire (PHQ-8) can finally be obtained. Moreover, Self-Attention Generative Adversarial Networks (SAGAN) is employed to increase the amount of training data available for depression severity analysis. Experimental results on the depression sub-challenge dataset of the Audio/Visual Emotion Challenge (AVEC 2017 and AVEC 2019) demonstrate the effectiveness of our proposed method.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Depression / diagnosis
  • Emotions
  • Humans
  • Machine Learning
  • Speech*