HDPose: Post-Hierarchical Diffusion with Conditioning for 3D Human Pose Estimation

Sensors (Basel). 2024 Jan 26;24(3):829. doi: 10.3390/s24030829.

Abstract

Recently, monocular 3D human pose estimation (HPE) methods were used to accurately predict 3D pose by solving the ill-pose problem caused by 3D-2D projection. However, monocular 3D HPE still remains challenging owing to the inherent depth ambiguity and occlusions. To address this issue, previous studies have proposed diffusion model-based approaches (DDPM) that learn to reconstruct a correct 3D pose from a noisy initial 3D pose. In addition, these approaches use 2D keypoints or context encoders that encode spatial and temporal information to inform the model. However, they often fall short of achieving peak performance, or require an extended period to converge to the target pose. In this paper, we proposed HDPose, which can converge rapidly and predict 3D poses accurately. Our approach aggregated spatial and temporal information from the condition into a denoising model in a hierarchical structure. We observed that the post-hierarchical structure achieved the best performance among various condition structures. Further, we evaluated our model on the widely used Human3.6M and MPI-INF-3DHP datasets. The proposed model demonstrated competitive performance with state-of-the-art models, achieving high accuracy with faster convergence while being considerably more lightweight.

Keywords: 3D human pose estimation; diffusion; hierarchical structure; transformer.

MeSH terms

  • Algorithms*
  • Humans
  • Imaging, Three-Dimensional* / methods