DctViT: Discrete Cosine Transform meet vision transformers

Keke Su; Lihua Cao; Botong Zhao; Ning Li; Di Wu; Xiyu Han; Yangfan Liu

doi:10.1016/j.neunet.2024.106139

DctViT: Discrete Cosine Transform meet vision transformers

Neural Netw. 2024 Apr:172:106139. doi: 10.1016/j.neunet.2024.106139. Epub 2024 Jan 19.

Authors

Keke Su¹, Lihua Cao², Botong Zhao³, Ning Li⁴, Di Wu⁵, Xiyu Han⁶, Yangfan Liu⁷

Affiliations

¹ Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun, 130033, Jilin, China; University of Chinese Academy of Sciences, Beijing, 100049, China. Electronic address: sukeke19@mails.ucas.ac.cn.
² Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun, 130033, Jilin, China. Electronic address: cao0983@sina.com.
³ School of Communication and Electronic Engineering, East China Normal University, Shanghai, 200241, China. Electronic address: zhaobotong19@mails.ucas.ac.cn.
⁴ Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun, 130033, Jilin, China. Electronic address: 119124328@qq.com.
⁵ Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun, 130033, Jilin, China. Electronic address: wudihope@163.com.
⁶ Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun, 130033, Jilin, China. Electronic address: hanxiyusdu@163.com.
⁷ Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun, 130033, Jilin, China. Electronic address: liuyangfan18@mails.ucas.ac.cn.

PMID: 38301338
DOI: 10.1016/j.neunet.2024.106139

Abstract

Vision transformers (ViTs) have become one of the dominant frameworks for vision tasks in recent years because of their ability to efficiently capture long-range dependencies in image recognition tasks using self-attention. In fact, both CNNs and ViTs have advantages and disadvantages in vision tasks, and some studies suggest that the use of both may be an effective way to balance performance and computational cost. In this paper, we propose a new hybrid network based on CNN and transformer, using CNN to extract local features and transformer to capture long-distance dependencies. We also proposed a new feature map resolution reduction based on Discrete Cosine Transform and self-attention, named DCT-Attention Down-sample (DAD). Our DctViT-L achieves 84.8% top-1 accuracy on ImageNet 1K, far outperforming CMT, Next-ViT, SpectFormer and other state-of-the-art models, with lower computational costs. Using DctViT-B as the backbone, RetinaNet can achieve 46.8% mAP on COCO val2017, which improves mAP by 2.5% and 1.1% with less calculation cost compared with CMT-S and SpectFormer as the backbone.

Keywords: Computer vision; Deep learning; Discrete cosine transform; Image classification; Vision transformer.

MeSH terms

Image Interpretation, Computer-Assisted*