Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules

Mizanur R Khondoker; Till T Bachmann; Muriel Mewissen; Paul Dickinson; Bartosz Dobrzelecki; Colin J Campbell; Andrew R Mount; Anthony J Walton; Jason Crain; Holger Schulze; Gerard Giraud; Alan J Ross; Ilenia Ciani; Stuart W J Ember; Chaker Tlili; Jonathan G Terry; Eilidh Grant; Nicola McDonnell; Peter Ghazal

doi:10.1142/s0219720010005063

Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules

J Bioinform Comput Biol. 2010 Dec;8(6):945-65. doi: 10.1142/s0219720010005063.

Affiliation

¹ Department of Biostatistics, Institute of Psychiatry and NIHR Biomedical, Research Centre for Mental Health at the South London and Maudsley NHS Foundation Trust, King's College London, De Crespigny Park, London, UK. mizanur.khondoker@kcl.ac.uk

PMID: 21121020
DOI: 10.1142/s0219720010005063

Abstract

Machine learning and statistical model based classifiers have increasingly been used with more complex and high dimensional biological data obtained from high-throughput technologies. Understanding the impact of various factors associated with large and complex microarray datasets on the predictive performance of classifiers is computationally intensive, under investigated, yet vital in determining the optimal number of biomarkers for various classification purposes aimed towards improved detection, diagnosis, and therapeutic monitoring of diseases. We investigate the impact of microarray based data characteristics on the predictive performance for various classification rules using simulation studies. Our investigation using Random Forest, Support Vector Machines, Linear Discriminant Analysis and k-Nearest Neighbour shows that the predictive performance of classifiers is strongly influenced by training set size, biological and technical variability, replication, fold change and correlation between biomarkers. Optimal number of biomarkers for a classification problem should therefore be estimated taking account of the impact of all these factors. A database of average generalization errors is built for various combinations of these factors. The database of generalization errors can be used for estimating the optimal number of biomarkers for given levels of predictive accuracy as a function of these factors. Examples show that curves from actual biological data resemble that of simulated data with corresponding levels of data characteristics. An R package optBiomarker implementing the method is freely available for academic use from the Comprehensive R Archive Network (http://www.cran.r-project.org/web/packages/optBiomarker/).

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Artificial Intelligence
Biomarkers* / blood
Classification / methods
Computational Biology*
Databases, Factual
Gene Expression Profiling / statistics & numerical data
Humans
Microarray Analysis / statistics & numerical data
Models, Statistical
Oligonucleotide Array Sequence Analysis / statistics & numerical data

Substances

Biomarkers

Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules

Authors

Affiliation

Abstract

Publication types

MeSH terms

Substances

Grants and funding