MMNet: A Mixing Module Network for Polyp Segmentation

Raman Ghimire; Sang-Woong Lee

doi:10.3390/s23167258

MMNet: A Mixing Module Network for Polyp Segmentation

Sensors (Basel). 2023 Aug 18;23(16):7258. doi: 10.3390/s23167258.

Authors

Raman Ghimire¹, Sang-Woong Lee²

Affiliations

¹ Pattern Recognition and Machine Learning Lab, Department of IT Convergence Engineering, Gachon University, Seongnam 13557, Republic of Korea.
² Pattern Recognition and Machine Learning Lab, Department of AI Software, Gachon University, Seongnam 13557, Republic of Korea.

Abstract

Traditional encoder-decoder networks like U-Net have been extensively used for polyp segmentation. However, such networks have demonstrated limitations in explicitly modeling long-range dependencies. In such networks, local patterns are emphasized over the global context, as each convolutional kernel focuses on only a local subset of pixels in the entire image. Several recent transformer-based networks have been shown to overcome such limitations. Such networks encode long-range dependencies using self-attention methods and thus learn highly expressive representations. However, due to the computational complexity of modeling the whole image, self-attention is expensive to compute, as there is a quadratic increment in cost with the increase in pixels in the image. Thus, patch embedding has been utilized, which groups small regions of the image into single input features. Nevertheless, these transformers still lack inductive bias, even with the image as a 1D sequence of visual tokens. This results in the inability to generalize to local contexts due to limited low-level features. We introduce a hybrid transformer combined with a convolutional mixing network to overcome computational and long-range dependency issues. A pretrained transformer network is introduced as a feature-extracting encoder, and a mixing module network (MMNet) is introduced to capture the long-range dependencies with a reduced computational cost. Precisely, in the mixing module network, we use depth-wise and 1 × 1 convolution to model long-range dependencies to establish spatial and cross-channel correlation, respectively. The proposed approach is evaluated qualitatively and quantitatively on five challenging polyp datasets across six metrics. Our MMNet outperforms the previous best polyp segmentation methods.

Keywords: computational complexity; depth-wise and 1 × 1 convolution; mixing module; polyp segmentation; transformer.

MeSH terms

Algorithms*
Benchmarking*
Electric Power Supplies
Learning

Grants and funding

GRRC-Gachon2020(B02)/Gyeonggi-do Regional Research Center