Corrected ROC analysis for misclassified binary outcomes

Stat Med. 2017 Jun 15;36(13):2148-2160. doi: 10.1002/sim.7260. Epub 2017 Feb 28.

Abstract

Creating accurate risk prediction models from Big Data resources such as Electronic Health Records (EHRs) is a critical step toward achieving precision medicine. A major challenge in developing these tools is accounting for imperfect aspects of EHR data, particularly the potential for misclassified outcomes. Misclassification, the swapping of case and control outcome labels, is well known to bias effect size estimates for regression prediction models. In this paper, we study the effect of misclassification on accuracy assessment for risk prediction models and find that it leads to bias in the area under the curve (AUC) metric from standard ROC analysis. The extent of the bias is determined by the false positive and false negative misclassification rates as well as disease prevalence. Notably, we show that simply correcting for misclassification while building the prediction model is not sufficient to remove the bias in AUC. We therefore introduce an intuitive misclassification-adjusted ROC procedure that accounts for uncertainty in observed outcomes and produces bias-corrected estimates of the true AUC. The method requires that misclassification rates are either known or can be estimated, quantities typically required for the modeling step. The computational simplicity of our method is a key advantage, making it ideal for efficiently comparing multiple prediction models on very large datasets. Finally, we apply the correction method to a hospitalization prediction model from a cohort of over 1 million patients from the Veterans Health Administrations EHR. Implementations of the ROC correction are provided for Stata and R. Published 2017. This article is a U.S. Government work and is in the public domain in the USA.

Keywords: ROC analysis; electronic health records; misclassification; precision medicine; risk prediction modeling.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Area Under Curve
  • Bias
  • Electronic Health Records
  • Hospitalization / statistics & numerical data
  • Humans
  • Models, Statistical*
  • ROC Curve*
  • Risk Assessment / methods
  • United States
  • United States Department of Veterans Affairs / statistics & numerical data