Quantifying uncertainty in machine learning classifiers for medical imaging

John Valen; Indranil Balki; Mauro Mendez; Wendi Qu; Jacob Levman; Alexander Bilbily; Pascal N Tyrrell

doi:10.1007/s11548-022-02578-3

Quantifying uncertainty in machine learning classifiers for medical imaging

Int J Comput Assist Radiol Surg. 2022 Apr;17(4):711-718. doi: 10.1007/s11548-022-02578-3. Epub 2022 Mar 12.

Authors

John Valen¹, Indranil Balki¹, Mauro Mendez¹, Wendi Qu¹, Jacob Levman², Alexander Bilbily¹, Pascal N Tyrrell^{3

4

5}

Affiliations

¹ Department of Medical Imaging, University of Toronto, Toronto, ON, M5T 1W7, Canada.
² Department of Computer Science, St. Francis Xavier University, Antigonish, NS, Canada.
³ Department of Medical Imaging, University of Toronto, Toronto, ON, M5T 1W7, Canada. pascal.tyrrell@utoronto.ca.
⁴ Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada. pascal.tyrrell@utoronto.ca.
⁵ Institute of Medical Science, University of Toronto, Toronto, ON, Canada. pascal.tyrrell@utoronto.ca.

PMID: 35278156
DOI: 10.1007/s11548-022-02578-3

Abstract

Purpose: Machine learning (ML) models in medical imaging (MI) can be of great value in computer aided diagnostic systems, but little attention is given to the confidence (alternatively, uncertainty) of such models, which may have significant clinical implications. This paper applied, validated, and explored a technique for assessing uncertainty in convolutional neural networks (CNNs) in the context of MI.

Materials and methods: We used two publicly accessible imaging datasets: a chest x-ray dataset (pneumonia vs. control) and a skin cancer imaging dataset (malignant vs. benign) to explore the proposed measure of uncertainty based on experiments with different class imbalance-sample sizes, and experiments with images close to the classification boundary. We also further verified our hypothesis by examining the relationship with other performance metrics and cross-checking CNN predictions and confidence scores with an expert radiologist (available in the Supplementary Information). Additionally, bounds were derived on the uncertainty metric, and recommendations for interpretability were made.

Results: With respect to training set class imbalance for the pneumonia MI dataset, the uncertainty metric was minimized when both classes were nearly equal in size (regardless of training set size) and was approximately 17% smaller than the maximum uncertainty resulting from greater imbalance. We found that less-obvious test images (those closer to the classification boundary) produced higher classification uncertainty, about 10-15 times greater than images further from the boundary. Relevant MI performance metrics like accuracy, sensitivity, and sensibility showed seemingly negative linear correlations, though none were statistically significant (p [Formula: see text] 0.05). The expert radiologist and CNN expressed agreement on a small sample of test images, though this finding is only preliminary.

Conclusions: This paper demonstrated the importance of uncertainty reporting alongside predictions in medical imaging. Results demonstrate considerable potential from automatically assessing classifier reliability on each prediction with the proposed uncertainty metric.

Keywords: Confidence; Medical imaging; Neural network; Uncertainty.

MeSH terms

Diagnostic Imaging
Humans
Machine Learning*
Neural Networks, Computer*
Reproducibility of Results
Uncertainty