FAFuse: A Four-Axis Fusion framework of CNN and Transformer for medical image segmentation

Shoukun Xu; Dehao Xiao; Baohua Yuan; Yi Liu; Xueyuan Wang; Ning Li; Lin Shi; Jialu Chen; Ju-Xiao Zhang; Yanhao Wang; Jianfeng Cao; Yeqin Shao; Mingjie Jiang

doi:10.1016/j.compbiomed.2023.107567

FAFuse: A Four-Axis Fusion framework of CNN and Transformer for medical image segmentation

Comput Biol Med. 2023 Oct 13:166:107567. doi: 10.1016/j.compbiomed.2023.107567. Online ahead of print.

Authors

Shoukun Xu¹, Dehao Xiao¹, Baohua Yuan², Yi Liu¹, Xueyuan Wang¹, Ning Li¹, Lin Shi¹, Jialu Chen¹, Ju-Xiao Zhang³, Yanhao Wang⁴, Jianfeng Cao⁵, Yeqin Shao⁶, Mingjie Jiang⁷

Affiliations

¹ Aliyun School of Big Data, Changzhou University, Changzhou, Jiangsu, 213164, China.
² Aliyun School of Big Data, Changzhou University, Changzhou, Jiangsu, 213164, China; Jiangsu Engineering Research Center of Digital Twinning Technology for Key Equipment in Petrochemical Process, Changzhou University, Changzhou, Jiangsu, 213164, China. Electronic address: yuanbaohua@cczu.edu.cn.
³ College of Information and Mathematics Science, Nanjing Normal University of Special Education, Nanjing 210038, Jiangsu, China.
⁴ Columbia University, NY, USA.
⁵ Department of Computer Science and Engineering, The Chinese University of Hong Kong, 999077, Hong Kong, China.
⁶ School of Transportation and Civil Engineering, Nantong University, Nantong, Jiangsu, 226001, China.
⁷ Department of Electrical Engineering, City University of Hong Kong, 999077, Hong Kong, China.

PMID: 37852109
DOI: 10.1016/j.compbiomed.2023.107567

Abstract

Medical image segmentation is crucial for accurate diagnosis and treatment in the medical field. In recent years, convolutional neural networks (CNNs) and Transformers have been frequently adopted as network architectures in medical image segmentation. The convolution operation is limited in modeling long-range dependencies because it can only extract local information through the limited receptive field. In comparison, Transformers demonstrate excellent capability in modeling long-range dependencies but are less effective in capturing local information. Hence, effectively modeling long-range dependencies while preserving local information is essential for accurate medical image segmentation. In this paper, we propose a four-axis fusion framework called FAFuse, which can exploit the advantages of CNN and Transformer. As the core component of our FAFuse, a Four-Axis Fusion module (FAF) is proposed to efficiently fuse global and local information. FAF combines Four-Axis attention (height, width, main diagonal, and counter diagonal axial attention), a multi-scale convolution, and a residual structure with a depth-separable convolution and a Hadamard product. Furthermore, we also introduce deep supervision to enhance gradient flow and improve overall performance. Our approach achieves state-of-the-art segmentation accuracy on three publicly available medical image segmentation datasets. The code is available at https://github.com/cczu-xiao/FAFuse.

Keywords: CNN; Feature fusion; Four-Axis attention; Medical image segmentation; Transformer.