Transformer-based CNNs: Mining Temporal Context Information for Multi-sound COVID-19 Diagnosis

Yi Chang; Zhao Ren; Bjorn W Schuller

doi:10.1109/EMBC46164.2021.9629552

Transformer-based CNNs: Mining Temporal Context Information for Multi-sound COVID-19 Diagnosis

Annu Int Conf IEEE Eng Med Biol Soc. 2021 Nov:2021:2335-2338. doi: 10.1109/EMBC46164.2021.9629552.

Authors

Yi Chang, Zhao Ren, Bjorn W Schuller

PMID: 34891751
DOI: 10.1109/EMBC46164.2021.9629552

Abstract

Due to the COronaVIrus Disease 2019 (COVID-19) pandemic, early screening of COVID-19 is essential to prevent its transmission. Detecting COVID-19 with computer audition techniques has in recent studies shown the potential to achieve a fast, cheap, and ecologically friendly diagnosis. Respiratory sounds and speech may contain rich and complementary information about COVID-19 clinical conditions. Therefore, we propose training three deep neural networks on three types of sounds (breathing/counting/vowel) and assembling these models to improve the performance. More specifically, we employ Convolutional Neural Networks (CNNs) to extract spatial representations from log Mel spectrograms and a multi-head attention mechanism in the transformer to mine temporal context information from the CNNs' outputs. The experimental results demonstrate that the transformer-based CNNs can effectively detect COVID-19 on the DiCOVA Track-2 database (AUC: 70.0%) and outperform simple CNNs and hybrid CNN-RNNs.

MeSH terms

COVID-19 Testing
COVID-19*
Humans
Neural Networks, Computer
SARS-CoV-2