Unsupervised Monocular Depth Estimation With Channel and Spatial Attention

IEEE Trans Neural Netw Learn Syst. 2022 Dec 2:PP. doi: 10.1109/TNNLS.2022.3221416. Online ahead of print.

Abstract

Understanding 3-D scene geometry from videos is a fundamental topic in visual perception. In this article, we propose an unsupervised monocular depth and camera motion estimation framework using unlabeled monocular videos to overcome the limitation of acquiring per-pixel ground-truth depth at scale. The photometric loss couples the depth network and pose network together and is essential to the unsupervised method, which is based on warping nearby views to target using the estimated depth and pose. We introduce the channelwise attention mechanism to dig into the relationship between channels and introduce the spatialwise attention mechanism to utilize the inner-spatial relationship of features. Both of them applied in depth networks can better activate the feature information between different convolutional layers and extract more discriminative features. In addition, we apply the Sobel boundary to our edge-aware smoothness for more reasonable accuracy, and clearer boundaries and structures. All of these help to close the gap with fully supervised methods and show high-quality state-of-the-art results on the KITTI benchmark and great generalization performance on the Make3D dataset.