Fine-Tuned Temporal Dense Sampling with 1D Convolutional Neural Network for Human Action Recognition

Kian Ming Lim; Chin Poo Lee; Kok Seang Tan; Ali Alqahtani; Mohammed Ali

doi:10.3390/s23115276

Fine-Tuned Temporal Dense Sampling with 1D Convolutional Neural Network for Human Action Recognition

Sensors (Basel). 2023 Jun 2;23(11):5276. doi: 10.3390/s23115276.

Authors

Kian Ming Lim¹, Chin Poo Lee¹, Kok Seang Tan¹, Ali Alqahtani^{2

3}, Mohammed Ali²

Affiliations

¹ Faculty of Information Science and Technology, Multimedia University, Melaka 75450, Malaysia.
² Department of Computer Science, King Khalid University, Abha 61421, Saudi Arabia.
³ Center for Artificial Intelligence (CAI), King Khalid University, Abha 61421, Saudi Arabia.

Abstract

Human action recognition is a constantly evolving field that is driven by numerous applications. In recent years, significant progress has been made in this area due to the development of advanced representation learning techniques. Despite this progress, human action recognition still poses significant challenges, particularly due to the unpredictable variations in the visual appearance of an image sequence. To address these challenges, we propose the fine-tuned temporal dense sampling with 1D convolutional neural network (FTDS-1DConvNet). Our method involves the use of temporal segmentation and temporal dense sampling, which help to capture the most important features of a human action video. First, the human action video is partitioned into segments through temporal segmentation. Each segment is then processed through a fine-tuned Inception-ResNet-V2 model, where max pooling is performed along the temporal axis to encode the most significant features as a fixed-length representation. This representation is then fed into a 1DConvNet for further representation learning and classification. The experiments on UCF101 and HMDB51 demonstrate that the proposed FTDS-1DConvNet outperforms the state-of-the-art methods, with a classification accuracy of 88.43% on the UCF101 dataset and 56.23% on the HMDB51 dataset.

Keywords: 1D convolutional neural network (1D ConvNet); 1D-CNN; Inception-ResNet-V2; human action recognition; temporal dense sampling.

MeSH terms

Human Activities
Humans
Image Processing, Computer-Assisted* / methods
Neural Networks, Computer
Pattern Recognition, Automated* / methods

Abstract

MeSH terms

Grants and funding