Rectify ViT Shortcut Learning by Visual Saliency

Chong Ma; Lin Zhao; Yuzhong Chen; Lei Guo; Tuo Zhang; Xintao Hu; Dinggang Shen; Xi Jiang; Tianming Liu

doi:10.1109/TNNLS.2023.3310531

Rectify ViT Shortcut Learning by Visual Saliency

IEEE Trans Neural Netw Learn Syst. 2023 Sep 13:PP. doi: 10.1109/TNNLS.2023.3310531. Online ahead of print.

Authors

Chong Ma, Lin Zhao, Yuzhong Chen, Lei Guo, Tuo Zhang, Xintao Hu, Dinggang Shen, Xi Jiang, Tianming Liu

PMID: 37703160
DOI: 10.1109/TNNLS.2023.3310531

Abstract

Shortcut learning in deep learning models occurs when unintended features are prioritized, resulting in degenerated feature representations and reduced generalizability and interpretability. However, shortcut learning in the widely used vision transformer (ViT) framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts that are predominated by background-related factors. For example, eye-gaze data from radiologists are effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions. However, obtaining eye-gaze data can still sometimes be time-consuming, labor-intensive, and even impractical. In this work, we propose a novel and effective saliency-guided ViT (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model (either pretrained or fine-tuned) is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to filter the most informative image patches. Considering that this filter operation may lead to global information loss, we further introduce a residual connection that calculates the self-attention across all the image patches. The experiment results on natural and medical image datasets show that our SGT framework can effectively learn and leverage human prior knowledge without eye-gaze data and achieves much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the interpretability of the ViT model, demonstrating the promise of transferring human prior knowledge derived visual saliency in rectifying shortcut learning.