Utilizing adaptive deformable convolution and position embedding for colon polyp segmentation with a visual transformer

Sci Rep. 2024 Mar 27;14(1):7318. doi: 10.1038/s41598-024-57993-0.

Abstract

Polyp detection is a challenging task in the diagnosis of Colorectal Cancer (CRC), and it demands clinical expertise due to the diverse nature of polyps. The recent years have witnessed the development of automated polyp detection systems to assist the experts in early diagnosis, considerably reducing the time consumption and diagnostic errors. In automated CRC diagnosis, polyp segmentation is an important step which is carried out with deep learning segmentation models. Recently, Vision Transformers (ViT) are slowly replacing these models due to their ability to capture long range dependencies among image patches. However, the existing ViTs for polyp do not harness the inherent self-attention abilities and incorporate complex attention mechanisms. This paper presents Polyp-Vision Transformer (Polyp-ViT), a novel Transformer model based on the conventional Transformer architecture, which is enhanced with adaptive mechanisms for feature extraction and positional embedding. Polyp-ViT is tested on the Kvasir-seg and CVC-Clinic DB Datasets achieving segmentation accuracies of 0.9891 ± 0.01 and 0.9875 ± 0.71 respectively, outperforming state-of-the-art models. Polyp-ViT is a prospective tool for polyp segmentation which can be adapted to other medical image segmentation tasks as well due to its ability to generalize well.

Keywords: Deformable convolution; Polyp segmentation; Vision transformer.

MeSH terms

  • Ambulatory Care Facilities
  • Colon
  • Diagnostic Errors
  • Electric Power Supplies
  • Humans
  • Image Processing, Computer-Assisted
  • Polyps*