Conformer: Local Features Coupling Global Representations for Recognition and Detection

Zhiliang Peng; Zonghao Guo; Wei Huang; Yaowei Wang; Lingxi Xie; Jianbin Jiao; Qi Tian; Qixiang Ye

doi:10.1109/TPAMI.2023.3243048

Conformer: Local Features Coupling Global Representations for Recognition and Detection

IEEE Trans Pattern Anal Mach Intell. 2023 Aug;45(8):9454-9468. doi: 10.1109/TPAMI.2023.3243048. Epub 2023 Jun 30.

Authors

Zhiliang Peng, Zonghao Guo, Wei Huang, Yaowei Wang, Lingxi Xie, Jianbin Jiao, Qi Tian, Qixiang Ye

PMID: 37022836
DOI: 10.1109/TPAMI.2023.3243048

Abstract

With convolution operations, Convolutional Neural Networks (CNNs) are good at extracting local features but experience difficulty to capture global representations. With cascaded self-attention modules, vision transformers can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take both advantages of convolution operations and self-attention mechanisms for enhanced representation learning. Conformer roots in feature coupling of CNN local features and transformer global representations under different resolutions in an interactive fashion. Conformer adopts a dual structure so that local details and global dependencies are retained to the maximum extent. We also propose a Conformer-based detector (ConformerDet), which learns to predict and refine object proposals, by performing region-level feature coupling in an augmented cross-attention fashion. Experiments on ImageNet and MS COCO datasets validate Conformer's superiority for visual recognition and object detection, demonstrating its potential to be a general backbone network.

MeSH terms

Algorithms*
Learning*
Neural Networks, Computer