Skeleton-Based Spatio-Temporal U-Network for 3D Human Pose Estimation in Video

Sensors (Basel). 2022 Mar 28;22(7):2573. doi: 10.3390/s22072573.

Abstract

Despite the great progress in 3D pose estimation from videos, there is still a lack of effective means to extract spatio-temporal features of different granularity from complex dynamic skeleton sequences. To tackle this problem, we propose a novel, skeleton-based spatio-temporal U-Net(STUNet) scheme to deal with spatio-temporal features in multiple scales for 3D human pose estimation in video. The proposed STUNet architecture consists of a cascade structure of semantic graph convolution layers and structural temporal dilated convolution layers, progressively extracting and fusing the spatio-temporal semantic features from fine-grained to coarse-grained. This U-shaped network achieves scale compression and feature squeezing by downscaling and upscaling, while abstracting multi-resolution spatio-temporal dependencies through skip connections. Experiments demonstrate that our model effectively captures comprehensive spatio-temporal features in multiple scales and achieves substantial improvements over mainstream methods on real-world datasets.

Keywords: 3D pose estimation; graph convolutional networks; non-local mechanics; temporal convolutional networks.

MeSH terms

  • Data Compression*
  • Humans
  • Neural Networks, Computer*
  • Pressure
  • Semantics
  • Skeleton