Merits of random forests emerge in evaluation of chemometric classifiers by external validation

I M Scott; W Lin; M Liakata; J E Wood; C P Vermeer; D Allaway; J L Ward; J Draper; M H Beale; D I Corol; J M Baker; R D King

doi:10.1016/j.aca.2013.09.027

Merits of random forests emerge in evaluation of chemometric classifiers by external validation

Anal Chim Acta. 2013 Nov 1:801:22-33. doi: 10.1016/j.aca.2013.09.027. Epub 2013 Sep 23.

Authors

I M Scott¹, W Lin, M Liakata, J E Wood, C P Vermeer, D Allaway, J L Ward, J Draper, M H Beale, D I Corol, J M Baker, R D King

Affiliation

¹ Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, SY23 3FG, UK. Electronic address: ias@aber.ac.uk.

PMID: 24139571
DOI: 10.1016/j.aca.2013.09.027

Abstract

Real-world applications will inevitably entail divergence between samples on which chemometric classifiers are trained and the unknowns requiring classification. This has long been recognized, but there is a shortage of empirical studies on which classifiers perform best in 'external validation' (EV), where the unknown samples are subject to sources of variation relative to the population used to train the classifier. Survey of 286 classification studies in analytical chemistry found only 6.6% that stated elements of variance between training and test samples. Instead, most tested classifiers using hold-outs or resampling (usually cross-validation) from the same population used in training. The present study evaluated a wide range of classifiers on NMR and mass spectra of plant and food materials, from four projects with different data properties (e.g., different numbers and prevalence of classes) and classification objectives. Use of cross-validation was found to be optimistic relative to EV on samples of different provenance to the training set (e.g., different genotypes, different growth conditions, different seasons of crop harvest). For classifier evaluations across the diverse tasks, we used ranks-based non-parametric comparisons, and permutation-based significance tests. Although latent variable methods (e.g., PLSDA) were used in 64% of the surveyed papers, they were among the less successful classifiers in EV, and orthogonal signal correction was counterproductive. Instead, the best EV performances were obtained with machine learning schemes that coped with the high dimensionality (914-1898 features). Random forests confirmed their resilience to high dimensionality, as best overall performers on the full data, despite being used in only 4.5% of the surveyed papers. Most other machine learning classifiers were improved by a feature selection filter (ReliefF), but still did not out-perform random forests.

Keywords: 9×CV; Classification; EV; External validation; FIE-MS; IID; LDA; Machine learning; OSC; PCA; PLSDA; Prediction; Random forest; ReliefF; SD; SIMCA; external validation; flow-injection electrospray-mass spectrometry; independent and identically distributed; linear discriminant analysis; nine-fold cross-validation; orthogonal signal correction; partial least squares discriminant analysis; principal component analysis; soft independent modeling of class analogy; standard deviation.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Arabidopsis / chemistry
Arabidopsis / classification
Arabidopsis / genetics
Arabidopsis / metabolism
Biomass
Cacao / chemistry
Cacao / classification
Cacao / genetics
Cacao / metabolism
Discriminant Analysis
Magnetic Resonance Spectroscopy*
Mass Spectrometry*
Metabolomics
Reproducibility of Results
Salicylic Acid / metabolism

Substances

Salicylic Acid

Abstract

Publication types

MeSH terms

Substances

Grants and funding