eGAC3D: enhancing depth adaptive convolution and depth estimation for monocular 3D object pose detection

Duc Tuan Ngo; Minh-Quan Viet Bui; Duc Dung Nguyen; Hoang-Anh Pham

doi:10.7717/peerj-cs.1144

eGAC3D: enhancing depth adaptive convolution and depth estimation for monocular 3D object pose detection

PeerJ Comput Sci. 2022 Nov 3:8:e1144. doi: 10.7717/peerj-cs.1144. eCollection 2022.

Authors

Duc Tuan Ngo^{1

2}, Minh-Quan Viet Bui^{1

2}, Duc Dung Nguyen^{1

2}, Hoang-Anh Pham^{1

2}

Affiliations

¹ Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh, Vietnam.
² Vietnam National University Ho Chi Minh City (VNU-HCM), Ho Chi Minh, Vietnam.

Abstract

Many alternative approaches for 3D object detection using a singular camera have been studied instead of leveraging high-precision 3D LiDAR sensors incurring a prohibitive cost. Recently, we proposed a novel approach for 3D object detection by employing a ground plane model that utilizes geometric constraints named GAC3D to improve the results of the deep-based detector. GAC3D adopts an adaptive depth convolution to replace the traditional 2D convolution to deal with the divergent context of the image's feature, leading to a significant improvement in both training convergence and testing accuracy on the KITTI 3D object detection benchmark. This article presents an alternative architecture named eGAC3D that adopts a revised depth adaptive convolution with variant guidance to improve detection accuracy. Additionally, eGAC3D utilizes the pixel adaptive convolution to leverage the depth map to guide our model for detection heads instead of using an external depth estimator like other methods leading to a significant reduction of time inference. The experimental results on the KITTI benchmark show that our eGAC3D outperforms not only our previous GAC3D but also many existing monocular methods in terms of accuracy and inference time. Moreover, we deployed and optimized the proposed eGAC3D framework on an embedded platform with a low-cost GPU. To the best of the authors' knowledge, we are the first to develop a monocular 3D detection framework on embedded devices. The experimental results on Jetson Xavier NX demonstrate that our proposed method can achieve nearly real-time performance with appropriate accuracy even with the modest hardware resource.

Keywords: 3D object pose detection; Adaptive convolution; Depth estimation.

Grants and funding

This research is funded by the Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number NCM2021-20-02. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.