Improved deep learning image classification algorithm based on Swin Transformer V2

PeerJ Comput Sci. 2023 Oct 30:9:e1665. doi: 10.7717/peerj-cs.1665. eCollection 2023.

Abstract

While convolutional operation effectively extracts local features, their limited receptive fields make it challenging to capture global dependencies. Transformer, on the other hand, excels at global modeling and effectively captures global dependencies. However, the self-attention mechanism used in Transformers lacks a local mechanism for information exchange within specific regions. This article attempts to leverage the strengths of both Transformers and convolutional neural networks (CNNs) to enhance the Swin Transformer V2 model. By incorporating both convolutional operation and self-attention mechanism, the enhanced model combines the local information-capturing capability of CNNs and the long-range dependency-capturing ability of Transformers. The improved model enhances the extraction of local information through the introduction of the Swin Transformer Stem, inverted residual feed-forward network, and Dual-Branch Downsampling structure. Subsequently, it models global dependencies using the improved self-attention mechanism. Additionally, downsampling is applied to the attention mechanism's Q and K to reduce computational and memory overhead. Under identical training conditions, the proposed method significantly improves classification accuracy on multiple image classification datasets, showcasing more robust generalization capabilities.

Keywords: Attention mechanism; Convolutional neural networks; Image classification; Transformer.

Grants and funding

This work was supported by the University-Industry Collaborative Education Program (Grant NO. 22097077265201). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.