Top-Down System for Multi-Person 3D Absolute Pose Estimation from Monocular Videos

Amal El Kaid; Denis Brazey; Vincent Barra; Karim Baïna

doi:10.3390/s22114109

Top-Down System for Multi-Person 3D Absolute Pose Estimation from Monocular Videos

Sensors (Basel). 2022 May 28;22(11):4109. doi: 10.3390/s22114109.

Authors

Amal El Kaid^{1

2

3}, Denis Brazey³, Vincent Barra¹, Karim Baïna²

Affiliations

¹ Université Clermont-Auvergne, CNRS, Mines de Saint-Étienne, Clermont-Auvergne-INP, LIMOS, 63000 Clermont-Ferrand, France.
² Alqualsadi Research Team, Rabat IT Center, ENSIAS, Mohammed V University in Rabat, Rabat 10112, Morocco.
³ Société Prynel, RD974, 21190 Corpeau, France.

Abstract

Two-dimensional (2D) multi-person pose estimation and three-dimensional (3D) root-relative pose estimation from a monocular RGB camera have made significant progress recently. Yet, real-world applications require depth estimations and the ability to determine the distances between people in a scene. Therefore, it is necessary to recover the 3D absolute poses of several people. However, this is still a challenge when using cameras from single points of view. Furthermore, the previously proposed systems typically required a significant amount of resources and memory. To overcome these restrictions, we herein propose a real-time framework for multi-person 3D absolute pose estimation from a monocular camera, which integrates a human detector, a 2D pose estimator, a 3D root-relative pose reconstructor, and a root depth estimator in a top-down manner. The proposed system, called Root-GAST-Net, is based on modified versions of GAST-Net and RootNet networks. The efficiency of the proposed Root-GAST-Net system is demonstrated through quantitative and qualitative evaluations on two benchmark datasets, Human3.6M and MuPoTS-3D. On all evaluated metrics, our experimental results on the MuPoTS-3D dataset outperform the current state-of-the-art by a significant margin, and can run in real-time at 15 fps on the Nvidia GeForce GTX 1080.

Keywords: 3D multi-person pose estimation; absolute poses; artificial intelligence; camera-centric coordinates; computer vision; deep-learning.

MeSH terms

Algorithms*
Humans
Imaging, Three-Dimensional* / methods

Abstract

MeSH terms

Grants and funding