View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition

Pengfei Zhang; Cuiling Lan; Junliang Xing; Wenjun Zeng; Jianru Xue; Nanning Zheng

doi:10.1109/TPAMI.2019.2896631

View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition

IEEE Trans Pattern Anal Mach Intell. 2019 Aug;41(8):1963-1978. doi: 10.1109/TPAMI.2019.2896631. Epub 2019 Jan 31.

Authors

Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jianru Xue, Nanning Zheng

PMID: 30714909
DOI: 10.1109/TPAMI.2019.2896631

Abstract

Skeleton-based human action recognition has recently attracted increasing attention thanks to the accessibility and the popularity of 3D skeleton data. One of the key challenges in action recognition lies in the large variations of action representations when they are captured from different viewpoints. In order to alleviate the effects of view variations, this paper introduces a novel view adaptation scheme, which automatically determines the virtual observation viewpoints over the course of an action in a learning based data driven manner. Instead of re-positioning the skeletons using a fixed human-defined prior criterion, we design two view adaptive neural networks, i.e., VA-RNN and VA-CNN, which are respectively built based on the recurrent neural network (RNN) with the Long Short-term Memory (LSTM) and the convolutional neural network (CNN). For each network, a novel view adaptation module learns and determines the most suitable observation viewpoints, and transforms the skeletons to those viewpoints for the end-to-end recognition with a main classification network. Ablation studies find that the proposed view adaptive models are capable of transforming the skeletons of various views to much more consistent virtual viewpoints. Therefore, the models largely eliminate the influence of the viewpoints, enabling the networks to focus on the learning of action-specific features and thus resulting in superior performance. In addition, we design a two-stream scheme (referred to as VA-fusion) that fuses the scores of the two networks to provide the final prediction, obtaining enhanced performance. Moreover, random rotation of skeleton sequences is employed to improve the robustness of view adaptation models and alleviate overfitting during training. Extensive experimental evaluations on five challenging benchmarks demonstrate the effectiveness of the proposed view-adaptive networks and superior performance over state-of-the-art approaches.

Publication types

Research Support, Non-U.S. Gov't