PSLT: A Light-Weight Vision Transformer With Ladder Self-Attention and Progressive Shift

IEEE Trans Pattern Anal Mach Intell. 2023 Sep;45(9):11120-11135. doi: 10.1109/TPAMI.2023.3265499. Epub 2023 Aug 7.

Abstract

Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency. However, ViT requires a large amount of computing resource to compute the global self-attention. In this work, we propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone that requires less computing resources (e.g., a relatively small number of parameters and FLOPs), termed Progressive Shift Ladder Transformer (PSLT). First, the ladder self-attention block reduces the computational cost by modelling local self-attention in each branch. In the meanwhile, the progressive shift mechanism is proposed to enlarge the receptive field in the ladder self-attention block by modelling diverse local self-attention for each branch and interacting among these branches. Second, the input feature of the ladder self-attention block is split equally along the channel dimension for each branch, which considerably reduces the computational cost in the ladder self-attention block (with nearly [Formula: see text] the amount of parameters and FLOPs), and the outputs of these branches are then collaborated by a pixel-adaptive fusion. Therefore, the ladder self-attention block with a relatively small number of parameters and FLOPs is capable of modelling long-range interactions. Based on the ladder self-attention block, PSLT performs well on several vision tasks, including image classification, objection detection and person re-identification. On the ImageNet-1 k dataset, PSLT achieves a top-1 accuracy of 79.9% with 9.2 M parameters and 1.9 G FLOPs, which is comparable to several existing models with more than 20 M parameters and 4 G FLOPs. Code is available at https://isee-ai.cn/wugaojie/PSLT.html.