Fully Cross-Attention Transformer for Guided Depth Super-Resolution

Ido Ariav; Israel Cohen

doi:10.3390/s23052723

Fully Cross-Attention Transformer for Guided Depth Super-Resolution

Sensors (Basel). 2023 Mar 2;23(5):2723. doi: 10.3390/s23052723.

Authors

Ido Ariav¹, Israel Cohen¹

Affiliation

¹ Andrew and Erna Viterbi Faculty of Electrical and Computer Engineering, Technion-Israel Institute of Technology, Haifa 3200003, Israel.

Abstract

Modern depth sensors are often characterized by low spatial resolution, which hinders their use in real-world applications. However, the depth map in many scenarios is accompanied by a corresponding high-resolution color image. In light of this, learning-based methods have been extensively used for guided super-resolution of depth maps. A guided super-resolution scheme uses a corresponding high-resolution color image to infer high-resolution depth maps from low-resolution ones. Unfortunately, these methods still have texture copying problems due to improper guidance from color images. Specifically, in most existing methods, guidance from the color image is achieved by a naive concatenation of color and depth features. In this paper, we propose a fully transformer-based network for depth map super-resolution. A cascaded transformer module extracts deep features from a low-resolution depth. It incorporates a novel cross-attention mechanism to seamlessly and continuously guide the color image into the depth upsampling process. Using a window partitioning scheme, linear complexity in image resolution can be achieved, so it can be applied to high-resolution images. The proposed method of guided depth super-resolution outperforms other state-of-the-art methods through extensive experiments.

Keywords: attention; deep learning; depth maps; multimodal; super-resolution; transformers.

Grants and funding

This work was supported by the PMRI—Peter Munk Research Institute-Technion.