Cross-Attention Based Multi-Resolution Feature Fusion Model for Self-Supervised Cervical OCT Image Classification

IEEE/ACM Trans Comput Biol Bioinform. 2023 Jul-Aug;20(4):2541-2554. doi: 10.1109/TCBB.2023.3246979. Epub 2023 Aug 9.

Abstract

Cervical cancer seriously endangers the health of the female reproductive system and even risks women's life in severe cases. Optical coherence tomography (OCT) is a non-invasive, real-time, high-resolution imaging technology for cervical tissues. However, since the interpretation of cervical OCT images is a knowledge-intensive, time-consuming task, it is tough to acquire a large number of high-quality labeled images quickly, which is a big challenge for supervised learning. In this study, we introduce the vision Transformer (ViT) architecture, which has recently achieved impressive results in natural image analysis, into the classification task of cervical OCT images. Our work aims to develop a computer-aided diagnosis (CADx) approach based on a self-supervised ViT-based model to classify cervical OCT images effectively. We leverage masked autoencoders (MAE) to perform self-supervised pre-training on cervical OCT images, so the proposed classification model has a better transfer learning ability. In the fine-tuning process, the ViT-based classification model extracts multi-scale features from OCT images of different resolutions and fuses them with the cross-attention module. The ten-fold cross-validation results on an OCT image dataset from a multi-center clinical study of 733 patients in China indicate that our model achieved an AUC value of 0.9963 ± 0.0069 with a 95.89 ± 3.30% sensitivity and 98.23 ± 1.36 % specificity, outperforming some state-of-the-art classification models based on Transformers and convolutional neural networks (CNNs) in the binary classification task of detecting high-risk cervical diseases, including high-grade squamous intraepithelial lesion (HSIL) and cervical cancer. Furthermore, our model with the cross-shaped voting strategy achieved a sensitivity of 92.06% and specificity of 95.56% on an external validation dataset containing 288 three-dimensional (3D) OCT volumes from 118 Chinese patients in a different new hospital. This result met or exceeded the average of four medical experts who have used OCT for over one year. In addition to promising classification performance, our model has a remarkable ability to detect and visualize local lesions using the attention map of the standard ViT model, providing good interpretability for gynecologists to locate and diagnose possible cervical diseases.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Diagnosis, Computer-Assisted
  • Female
  • Humans
  • Image Processing, Computer-Assisted / methods
  • Neural Networks, Computer
  • Tomography, Optical Coherence / methods
  • Uterine Cervical Neoplasms* / diagnostic imaging