Multimodal fall detection for solitary individuals based on audio-video decision fusion processing

Shiqin Jiao; Guoqi Li; Guiyang Zhang; Jiahao Zhou; Jihong Li

doi:10.1016/j.heliyon.2024.e29596

Multimodal fall detection for solitary individuals based on audio-video decision fusion processing

Heliyon. 2024 Apr 16;10(8):e29596. doi: 10.1016/j.heliyon.2024.e29596. eCollection 2024 Apr 30.

Authors

Shiqin Jiao¹, Guoqi Li¹, Guiyang Zhang¹, Jiahao Zhou², Jihong Li¹

Affiliations

¹ School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China.
² Jinan Thomas School, Jinan, Shandong 250102, China.

Abstract

Falls often pose significant safety risks to solitary individuals, especially the elderly. Implementing a fast and efficient fall detection system is an effective strategy to address this hidden danger. We propose a multimodal method based on audio and video. On the basis of using non-intrusive equipment, it reduces to a certain extent the false negative situation that the most commonly used video-based methods may face due to insufficient lighting conditions, exceeding the monitoring range, etc. Therefore, in the foreseeable future, methods based on audio and video fusion are expected to become the best solution for fall detection. Specifically, this article outlines the following methodology: the video-based model utilizes YOLOv7-Pose to extract key skeleton joints, which are then fed into a two stream Spatial Temporal Graph Convolutional Network (ST-GCN) for classification. Meanwhile, the audio-based model employs log-scaled mel spectrograms to capture different features, which are processed through the MobileNetV2 architecture for detection. The final decision fusion of the two results is achieved through linear weighting and Dempster-Shafer (D-S) theory. After evaluation, our multimodal fall detection method significantly outperforms the single modality method, especially the evaluation metric sensitivity increased from 81.67% in single video modality to 96.67% (linear weighting) and 97.50% (D-S theory), which emphasizing the effectiveness of integrating video and audio data to achieve more powerful and reliable fall detection in complex and diverse daily life environments.

Keywords: Audio-video fusion; Fall detection; Multimodal analysis; Solitary individuals.