Three-stream Attention-aware Network for RGB-D Salient Object Detection

IEEE Trans Image Process. 2019 Jan 7. doi: 10.1109/TIP.2019.2891104. Online ahead of print.

Abstract

Previous RGB-D fusion systems based on convolutional neural networks (CNNs) typically employ a two-stream architecture, in which RGB and depth inputs are learnt independently. The multi-modal fusion stage is typically performed by concatenating the deep features from each stream in the inference process. The traditional two-stream architecture might experience insufficient multi-modal fusion due to two following limitations: (1) The cross-modal complementarity is rarely studied in the bottom-up path, wherein we believe the crossmodal complements can be combined to learn new discriminative features to enlarge the RGB-D representation community; (2) The cross-modal channels are typically combined by undifferentiated concatenation, which appears ambiguous to select cross-modal complementary features. In this work, we address these two limitations by proposing a novel three-stream attention-aware multi-modal fusion network. In the proposed architecture, a cross-modal distillation stream, accompanying the RGB-specific and depth-specific streams, is introduced to extract new RGB-D features in each level in the bottom-up path. Furthermore, the channel-wise attention mechanism is innovatively introduced to the cross-modal cross-level fusion problem to adaptively select complementary feature maps from each modality in each level. Extensive experiments report the effectiveness of the proposed architecture and the significant improvement over state-of-theart RGB-D salient object detection methods.