What Makes for Hierarchical Vision Transformer?

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12714-12720. doi: 10.1109/TPAMI.2023.3282019.

Abstract

Recent studies indicate that hierarchical Vision Transformer (ViT) with a macro architecture of interleaved non-overlapped window-based self-attention & shifted-window operation can achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutional neural networks (CNNs) using densely slid kernels. In most recently proposed hierarchical ViTs, self-attention is the de-facto standard for spatial information aggregation. In this paper, we question whether self-attention is the only choice for hierarchical ViT to attain strong performance, and study the effects of different kinds of cross-window communication methods. To this end, we replace self-attention layers with embarrassingly simple linear mapping layers, and the resulting proof-of-concept architecture termed TransLinear can achieve very strong performance in ImageNet-[Formula: see text] image recognition. Moreover, we find that TransLinear is able to leverage the ImageNet pre-trained weights and demonstrates competitive transfer learning properties on downstream dense prediction tasks such as object detection and instance segmentation. We also experiment with other alternatives to self-attention for content aggregation inside each non-overlapped window under different cross-window communication approaches. Our results reveal that the macro architecture, other than specific aggregation layers or cross-window communication mechanisms, is more responsible for hierarchical ViT's strong performance and is the real challenger to the ubiquitous CNN's dense sliding window paradigm.