Stratification by Tumor Grade Groups in a Holistic Evaluation of Machine Learning for Brain Tumor Segmentation

Snehal Prabhudesai; Nicholas Chandler Wang; Vinayak Ahluwalia; Xun Huan; Jayapalli Rajiv Bapuraj; Nikola Banovic; Arvind Rao

doi:10.3389/fnins.2021.740353

Stratification by Tumor Grade Groups in a Holistic Evaluation of Machine Learning for Brain Tumor Segmentation

Front Neurosci. 2021 Oct 6:15:740353. doi: 10.3389/fnins.2021.740353. eCollection 2021.

Authors

Snehal Prabhudesai¹, Nicholas Chandler Wang², Vinayak Ahluwalia³, Xun Huan⁴, Jayapalli Rajiv Bapuraj⁵, Nikola Banovic¹, Arvind Rao^{2

6

7

8}

Affiliations

¹ Computer Science and Engineering, University of Michigan, Ann Arbor, MI, United States.
² Computational Medicine and Bioinformatics, Michigan Medicine, Ann Arbor, MI, United States.
³ Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, United States.
⁴ Mechanical Engineering, University of Michigan, Ann Arbor, MI, United States.
⁵ Department of Radiology, University of Michigan, Ann Arbor, MI, United States.
⁶ Department of Biostatistics, University of Michigan, Ann Arbor, MI, United States.
⁷ Department of Radiation Oncology, University of Michigan, Ann Arbor, MI, United States.
⁸ Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI, United States.

Abstract

Accurate and consistent segmentation plays an important role in the diagnosis, treatment planning, and monitoring of both High Grade Glioma (HGG), including Glioblastoma Multiforme (GBM), and Low Grade Glioma (LGG). Accuracy of segmentation can be affected by the imaging presentation of glioma, which greatly varies between the two tumor grade groups. In recent years, researchers have used Machine Learning (ML) to segment tumor rapidly and consistently, as compared to manual segmentation. However, existing ML validation relies heavily on computing summary statistics and rarely tests the generalizability of an algorithm on clinically heterogeneous data. In this work, our goal is to investigate how to holistically evaluate the performance of ML algorithms on a brain tumor segmentation task. We address the need for rigorous evaluation of ML algorithms and present four axes of model evaluation-diagnostic performance, model confidence, robustness, and data quality. We perform a comprehensive evaluation of a glioma segmentation ML algorithm by stratifying data by specific tumor grade groups (GBM and LGG) and evaluate these algorithms on each of the four axes. The main takeaways of our work are-(1) ML algorithms need to be evaluated on out-of-distribution data to assess generalizability, reflective of tumor heterogeneity. (2) Segmentation metrics alone are limited to evaluate the errors made by ML algorithms and their describe their consequences. (3) Adoption of tools in other domains such as robustness (adversarial attacks) and model uncertainty (prediction intervals) lead to a more comprehensive performance evaluation. Such a holistic evaluation framework could shed light on an algorithm's clinical utility and help it evolve into a more clinically valuable tool.

Keywords: GBM; LGG; brain imaging; evaluation; medical AI; segmentation.

Grants and funding

R37 CA214955/CA/NCI NIH HHS/United States