Medical lesion segmentation by combining multimodal images with modality weighted UNet

Med Phys. 2022 Jun;49(6):3692-3704. doi: 10.1002/mp.15610. Epub 2022 Apr 7.

Abstract

Purpose: Automatic segmentation of medical lesions is a prerequisite for efficient clinic analysis. Segmentation algorithms for multimodal medical images have received much attention in recent years. Different strategies for multimodal combination (or fusion), such as probability theory, fuzzy models, belief functions, and deep neural networks, have also been developed. In this paper, we propose the modality weighted UNet (MW-UNet) and attention-based fusion method to combine multimodal images for medical lesion segmentation.

Methods: MW-UNet is a multimodal fusion method which is based on UNet, but we use a shallower layer and fewer feature map channels to reduce the amount of network parameters, and our method uses the new multimodal fusion method called fusion attention. It uses weighted sum rule and fusion attention to combine feature maps in intermediate layers. During training, all the weight parameters are updated through backpropagation like other parameters in the network. We also incorporate residual blocks into MW-UNet to further improve segmentation performance. The comparison between the automatic multimodal lesion segmentations and the manual contours was quantified by (1) five metrics including Dice, 95% Hausdorff Distance (HD95), volumetric overlap error (VOE), relative volume difference (RVD), and mean-Intersection-over-Union (mIoU); (2) Number of parameters and flops to calculate the complexity of the network.

Results: The proposed method is verified on ZJCHD, which is the data set of contrast-enhanced computed tomography (CECT) for Liver Lesion Segmentation taken from Cancer Hospital of the University of Chinese Academy of Sciences (Zhejiang Cancer Hospital), Hangzhou, China. For accuracy evaluation, we use 120 patients with liver lesions from ZJCHD, of which 100 are used for fourfold cross-validation (CV) and 20 are used for hold-out (HO) test. The mean Dice was 90.55 ± 14.44 % $90.55 \pm 14.44\%$ and 89.31 ± 19.07 % $89.31 \pm 19.07\%$ for HO and CV tests, respectively. The corresponding HD95, VOE, RVD, and mIoU of the two tests are 1.95 ± 1.83 and 2.67 ± 3.35 mm, 13.11 ± 15.83 and 13.13 ± 18.52 % $13.13 \pm 18.52 \%$ , 12.20 ± 18.20 and 13.00 ± 21.82 % $13.00 \pm 21.82 \%$ , and 83.79 ± 15.83 and 82.35 ± 20.03 % $82.35 \pm 20.03 \%$ . The parameters and flops of our method is 4.04 M and 18.36 G, respectively.

Conclusions: The results show that our method performs well on multimodal liver lesion segmentation. It can be easily extended to other multimodal data sets and other networks for multimodal fusion. Our method is the potential to provide doctors with multimodal annotations and assist them with clinical diagnosis.

Keywords: attention; deep neural networks; medical image segmentation; multimodality fusion.

MeSH terms

  • Abdomen
  • Algorithms
  • Humans
  • Image Processing, Computer-Assisted / methods
  • Liver
  • Neural Networks, Computer*
  • Tomography, X-Ray Computed* / methods