A Deep Sequence Learning Framework for Action Recognition in Small-Scale Depth Video Dataset

Mohammad Farhad Bulbul; Amin Ullah; Hazrat Ali; Daijin Kim

doi:10.3390/s22186841

A Deep Sequence Learning Framework for Action Recognition in Small-Scale Depth Video Dataset

Sensors (Basel). 2022 Sep 9;22(18):6841. doi: 10.3390/s22186841.

Authors

Mohammad Farhad Bulbul^{1

2}, Amin Ullah³, Hazrat Ali⁴, Daijin Kim¹

Affiliations

¹ Department of Computer Science and Engineering, Pohang University of Science and Technology (POSTECH), 77 Cheongam, Pohang 37673, Korea.
² Department of Mathematics, Jashore University of Science and Technology, Jashore 7408, Bangladesh.
³ CORIS Institute, Oregon State University, Corvallis, OR 97331, USA.
⁴ College of Science and Engineering, Hamad Bin Khalifa University, Qatar Foundation, Doha P.O. Box 34110, Qatar.

Abstract

Depth video sequence-based deep models for recognizing human actions are scarce compared to RGB and skeleton video sequences-based models. This scarcity limits the research advancements based on depth data, as training deep models with small-scale data is challenging. In this work, we propose a sequence classification deep model using depth video data for scenarios when the video data are limited. Unlike summarizing the frame contents of each frame into a single class, our method can directly classify a depth video, i.e., a sequence of depth frames. Firstly, the proposed system transforms an input depth video into three sequences of multi-view temporal motion frames. Together with the three temporal motion sequences, the input depth frame sequence offers a four-stream representation of the input depth action video. Next, the DenseNet121 architecture is employed along with ImageNet pre-trained weights to extract the discriminating frame-level action features of depth and temporal motion frames. The extracted four sets of feature vectors about frames of four streams are fed into four bi-directional (BLSTM) networks. The temporal features are further analyzed through multi-head self-attention (MHSA) to capture multi-view sequence correlations. Finally, the concatenated genre of their outputs is processed through dense layers to classify the input depth video. The experimental results on two small-scale benchmark depth datasets, MSRAction3D and DHA, demonstrate that the proposed framework is efficacious even for insufficient training samples and superior to the existing depth data-based action recognition methods.

Keywords: 3D action recognition; CNN; RNN; attention; bi-directional LSTM; depth map sequence; transfer learning.

MeSH terms

Databases, Factual
Human Activities*
Humans
Motion
Neural Networks, Computer*
Skeleton

Grants and funding

2017-0-00897 and 2018-0-01290/Institute of Information & communications Technology Planning & Evaluation, Korea government