CASNet: A Cross-Attention Siamese Network for Video Salient Object Detection

Yuzhu Ji; Haijun Zhang; Zequn Jie; Lin Ma; Q M Jonathan Wu

doi:10.1109/TNNLS.2020.3007534

CASNet: A Cross-Attention Siamese Network for Video Salient Object Detection

IEEE Trans Neural Netw Learn Syst. 2021 Jun;32(6):2676-2690. doi: 10.1109/TNNLS.2020.3007534. Epub 2021 Jun 2.

Authors

Yuzhu Ji, Haijun Zhang, Zequn Jie, Lin Ma, Q M Jonathan Wu

PMID: 32692684
DOI: 10.1109/TNNLS.2020.3007534

Abstract

Recent works on video salient object detection have demonstrated that directly transferring the generalization ability of image-based models to video data without modeling spatial-temporal information remains nontrivial and challenging. Considering both intraframe accuracy and interframe consistency of saliency detection, this article presents a novel cross-attention based encoder-decoder model under the Siamese framework (CASNet) for video salient object detection. A baseline encoder-decoder model trained with Lovász softmax loss function is adopted as a backbone network to guarantee the accuracy of intraframe salient object detection. Self- and cross-attention modules are incorporated into our model in order to preserve the saliency correlation and improve intraframe salient detection consistency. Extensive experimental results obtained by ablation analysis and cross-data set validation demonstrate the effectiveness of our proposed method. Quantitative results indicate that our CASNet model outperforms 19 state-of-the-art image- and video-based methods on six benchmark data sets.