Energy-Guided Temporal Segmentation Network for Multimodal Human Action Recognition

Sensors (Basel). 2020 Aug 19;20(17):4673. doi: 10.3390/s20174673.

Abstract

To achieve the satisfactory performance of human action recognition, a central task is to address the sub-action sharing problem, especially in similar action classes. Nevertheless, most existing convolutional neural network (CNN)-based action recognition algorithms uniformly divide video into frames and then randomly select the frames as inputs, ignoring the distinct characteristics among different frames. In recent years, depth videos have been increasingly used for action recognition, but most methods merely focus on the spatial information of the different actions without utilizing temporal information. In order to address these issues, a novel energy-guided temporal segmentation method is proposed here, and a multimodal fusion strategy is employed with the proposed segmentation method to construct an energy-guided temporal segmentation network (EGTSN). Specifically, the EGTSN had two parts: energy-guided video segmentation and a multimodal fusion heterogeneous CNN. The proposed solution was evaluated on a public large-scale NTU RGB+D dataset. Comparisons with state-of-the-art methods demonstrate the effectiveness of the proposed network.

Keywords: heterogeneous convolutional neural networks; motion energy; multimodal action recognition; temporal segmentation network.

MeSH terms

  • Algorithms*
  • Human Activities
  • Humans
  • Neural Networks, Computer*