The effect of sample size and disease prevalence on supervised machine learning of narrative data

Lawrence K McKnight; Adam Wilcox; George Hripcsak

The effect of sample size and disease prevalence on supervised machine learning of narrative data

Proc AMIA Symp. 2002:519-22.

Authors

Lawrence K McKnight¹, Adam Wilcox, George Hripcsak

Affiliation

¹ Department of Medical Informatics, Columbia University, New York, NY, USA.

PMID: 12463878
PMCID: PMC2244149

Abstract

This paper examines the independent effects of outcome prevalence and training sample sizes on inductive learning performance. We trained 3 inductive learning algorithms (MC4, IB, and Naïve-Bayes) on 60 simulated datasets of parsed radiology text reports labeled with 6 disease states. Data sets were constructed to define positive outcome states at 4 prevalence rates (1, 5, 10, 25, and 50%) in training set sizes of 200 and 2,000 cases. We found that the effect of outcome prevalence is significant when outcome classes drop below 10% of cases. The effect appeared independent of sample size, induction algorithm used, or class label. Work is needed to identify methods of improving classifier performance when output classes are rare.

Publication types

Research Support, U.S. Gov't, P.H.S.

MeSH terms

Algorithms
Artificial Intelligence*
Humans
Lung / diagnostic imaging*
Lung Diseases / diagnostic imaging
Radiography
Sample Size

Abstract

Publication types

MeSH terms

Grants and funding