The effects of different levels of realism on the training of CNNs with only synthetic images for the semantic segmentation of robotic instruments in a head phantom

Int J Comput Assist Radiol Surg. 2020 Aug;15(8):1257-1265. doi: 10.1007/s11548-020-02185-0. Epub 2020 May 22.

Abstract

Purpose: The manual generation of training data for the semantic segmentation of medical images using deep neural networks is a time-consuming and error-prone task. In this paper, we investigate the effect of different levels of realism on the training of deep neural networks for semantic segmentation of robotic instruments. An interactive virtual-reality environment was developed to generate synthetic images for robot-aided endoscopic surgery. In contrast with earlier works, we use physically based rendering for increased realism.

Methods: Using a virtual reality simulator that replicates our robotic setup, three synthetic image databases with an increasing level of realism were generated: flat, basic, and realistic (using the physically-based rendering). Each of those databases was used to train 20 instances of a UNet-based semantic-segmentation deep-learning model. The networks trained with only synthetic images were evaluated on the segmentation of 160 endoscopic images of a phantom. The networks were compared using the Dwass-Steel-Critchlow-Fligner nonparametric test.

Results: Our results show that the levels of realism increased the mean intersection-over-union (mIoU) of the networks on endoscopic images of a phantom ([Formula: see text]). The median mIoU values were 0.235 for the flat dataset, 0.458 for the basic, and 0.729 for the realistic. All the networks trained with synthetic images outperformed naive classifiers. Moreover, in an ablation study, we show that the mIoU of physically based rendering is superior to texture mapping ([Formula: see text]) of the instrument (0.606), the background (0.685), and the background and instruments combined (0.672).

Conclusions: Using physical-based rendering to generate synthetic images is an effective approach to improve the training of neural networks for the semantic segmentation of surgical instruments in endoscopic images. Our results show that this strategy can be an essential step in the broad applicability of deep neural networks in semantic segmentation tasks and help bridge the domain gap in machine learning.

Keywords: Deep learning; Photorealistic rendering; Semantic segmentation.

MeSH terms

  • Databases, Factual
  • Endoscopy
  • Humans
  • Image Processing, Computer-Assisted / methods
  • Machine Learning*
  • Neural Networks, Computer*
  • Phantoms, Imaging
  • Robotic Surgical Procedures / education*
  • Simulation Training*

Grants and funding