From quantitative metrics to clinical success: assessing the utility of deep learning for tumor segmentation in breast surgery

Chris Yeung; Tamas Ungi; Zoe Hu; Amoon Jamzad; Martin Kaufmann; Ross Walker; Shaila Merchant; Cecil Jay Engel; Doris Jabs; John Rudan; Parvin Mousavi; Gabor Fichtinger

doi:10.1007/s11548-024-03133-y

From quantitative metrics to clinical success: assessing the utility of deep learning for tumor segmentation in breast surgery

Int J Comput Assist Radiol Surg. 2024 Apr 20. doi: 10.1007/s11548-024-03133-y. Online ahead of print.

Authors

Chris Yeung¹, Tamas Ungi², Zoe Hu³, Amoon Jamzad², Martin Kaufmann⁴, Ross Walker⁴, Shaila Merchant⁴, Cecil Jay Engel⁴, Doris Jabs⁵, John Rudan⁴, Parvin Mousavi², Gabor Fichtinger²

Affiliations

¹ School of Computing, Queen's University, Kingston, ON, Canada. chris.yeung@queensu.ca.
² School of Computing, Queen's University, Kingston, ON, Canada.
³ School of Medicine, Queen's University, Kingston, ON, Canada.
⁴ Department of Surgery, Queen's University, Kingston, ON, Canada.
⁵ Department of Radiology, Queen's University, Kingston, ON, Canada.

PMID: 38642296
DOI: 10.1007/s11548-024-03133-y

Abstract

Purpose: Preventing positive margins is essential for ensuring favorable patient outcomes following breast-conserving surgery (BCS). Deep learning has the potential to enable this by automatically contouring the tumor and guiding resection in real time. However, evaluation of such models with respect to pathology outcomes is necessary for their successful translation into clinical practice.

Methods: Sixteen deep learning models based on established architectures in the literature are trained on 7318 ultrasound images from 33 patients. Models are ranked by an expert based on their contours generated from images in our test set. Generated contours from each model are also analyzed using recorded cautery trajectories of five navigated BCS cases to predict margin status. Predicted margins are compared with pathology reports.

Results: The best-performing model using both quantitative evaluation and our visual ranking framework achieved a mean Dice score of 0.959. Quantitative metrics are positively associated with expert visual rankings. However, the predictive value of generated contours was limited with a sensitivity of 0.750 and a specificity of 0.433 when tested against pathology reports.

Conclusion: We present a clinical evaluation of deep learning models trained for intraoperative tumor segmentation in breast-conserving surgery. We demonstrate that automatic contouring is limited in predicting pathology margins despite achieving high performance on quantitative metrics.

Keywords: Breast ultrasound; Clinical evaluation; Deep learning; Surgical navigation.