Attention guided learnable time-domain filterbanks for speech depression detection

Wenju Yang; Jiankang Liu; Peng Cao; Rongxin Zhu; Yang Wang; Jian K Liu; Fei Wang; Xizhe Zhang

doi:10.1016/j.neunet.2023.05.041

Attention guided learnable time-domain filterbanks for speech depression detection

Neural Netw. 2023 Aug:165:135-149. doi: 10.1016/j.neunet.2023.05.041. Epub 2023 May 26.

Authors

Wenju Yang¹, Jiankang Liu¹, Peng Cao², Rongxin Zhu³, Yang Wang³, Jian K Liu⁴, Fei Wang⁵, Xizhe Zhang⁶

Affiliations

¹ College of Computer Science and Engineering, Northeastern University, Shenyang, 110819, Liaoning, China; Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University, Shenyang, 110819, Liaoning, China.
² College of Computer Science and Engineering, Northeastern University, Shenyang, 110819, Liaoning, China; Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University, Shenyang, 110819, Liaoning, China. Electronic address: caopeng@cse.neu.edu.cn.
³ Early Intervention Unit, Department of Psychiatry, Affiliated Nanjing Brain Hospital, Nanjing Medical University, Nanjing, 210096, China.
⁴ School of Computing, University of Leeds, Leeds, LS2 9JT, United Kingdom.
⁵ Early Intervention Unit, Department of Psychiatry, Affiliated Nanjing Brain Hospital, Nanjing Medical University, Nanjing, 210096, China. Electronic address: fei.wang@yale.edu.
⁶ School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing, 211166, China. Electronic address: zhangxizhe@njmu.edu.cn.

PMID: 37285730
DOI: 10.1016/j.neunet.2023.05.041

Abstract

Depression, as a global mental health problem, is lacking effective screening methods that can help with early detection and treatment. This paper aims to facilitate the large-scale screening of depression by focusing on the speech depression detection (SDD) task. Currently, direct modeling on the raw signal yields a large number of parameters, and the existing deep learning-based SDD models mainly use the fixed Mel-scale spectral features as input. However, these features are not designed for depression detection, and the manual settings limit the exploration of fine-grained feature representations. In this paper, we learn the effective representations of the raw signals from an interpretable perspective. Specifically, we present a joint learning framework with attention-guided learnable time-domain filterbanks for depression classification (DALF), which collaborates with the depression filterbanks features learning (DFBL) module and multi-scale spectral attention learning (MSSA) module. DFBL is capable of producing biologically meaningful acoustic features by employing learnable time-domain filters, and MSSA is used to guide the learnable filters to better retain the useful frequency sub-bands. We collect a new dataset, the Neutral Reading-based Audio Corpus (NRAC), to facilitate the research in depression analysis, and we evaluate the performance of DALF on the NRAC and the public DAIC-woz datasets. The experimental results demonstrate that our method outperforms the state-of-the-art SDD methods with an F1 of 78.4% on the DAIC-woz dataset. In particular, DALF achieves F1 scores of 87.3% and 81.7% on two parts of the NRAC dataset. By analyzing the filter coefficients, we find that the most important frequency range identified by our method is 600-700Hz, which corresponds to the Mandarin vowels /e/ and /eˆ/ and can be considered as an effective biomarker for the SDD task. Taken together, our DALF model provides a promising approach to depression detection.

Keywords: Affective computing; Filterbanks; Interpretability; Speech depression detection; Time–frequency analysis.

MeSH terms

Acoustics
Depression* / diagnosis
Methyl Parathion*
Speech

Substances

Methyl Parathion