Performance metric curve analysis framework to assess impact of the decision variable threshold, disease prevalence, and dataset variability in two-class classification

Heather M Whitney; Karen Drukker; Maryellen L Giger

doi:10.1117/1.JMI.9.3.035502

Performance metric curve analysis framework to assess impact of the decision variable threshold, disease prevalence, and dataset variability in two-class classification

J Med Imaging (Bellingham). 2022 May;9(3):035502. doi: 10.1117/1.JMI.9.3.035502. Epub 2022 May 31.

Authors

Heather M Whitney^{1

2}, Karen Drukker¹, Maryellen L Giger¹

Affiliations

¹ University of Chicago, Department of Radiology, Chicago, Illinois, United States.
² Wheaton College, Department of Physics, Wheaton, Illinois, United States.

Abstract

Purpose: The aim of this study is to (1) demonstrate a graphical method and interpretation framework to extend performance evaluation beyond receiver operating characteristic curve analysis and (2) assess the impact of disease prevalence and variability in training and testing sets, particularly when a specific operating point is used. Approach: The proposed performance metric curves (PMCs) simultaneously assess sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), and the 95% confidence intervals thereof, as a function of the threshold for the decision variable. We investigated the utility of PMCs using six example operating points associated with commonly used methods to select operating points (including the Youden index and maximum mutual information). As an example, we applied PMCs to the task of distinguishing between malignant and benign breast lesions using human-engineered radiomic features extracted from dynamic contrast-enhanced magnetic resonance images. The dataset had 1885 lesions, with the images acquired in 2015 and 2016 serving as the training set (1450 lesions) and those acquired in 2017 as the test set (435 lesions). Our study used this dataset in two ways: (1) the clinical dataset itself and (2) simulated datasets with features based on the clinical set but with five different disease prevalences. The median and 95% CI of the number of type I (false positive) and type II (false negative) errors were determined for each operating point of interest. Results: PMCs from both the clinical and simulated datasets demonstrated that PMCs could support interpretation of the impact of decision threshold choice on type I and type II errors of classification, particularly relevant to prevalence. Conclusion: PMCs allow simultaneous evaluation of the four performance metrics of sensitivity, specificity, PPV, and NPV as a function of the decision threshold. This may create a better understanding of two-class classifier performance in machine learning.

Keywords: AUC; artificial intelligence; machine learning; performance assessment; radiomics; repeatability.

Abstract

Grants and funding