The importance of choosing a proper validation strategy in predictive models. A tutorial with real examples

Eneko Lopez; Jaione Etxebarria-Elezgarai; Jose Manuel Amigo; Andreas Seifert

doi:10.1016/j.aca.2023.341532

The importance of choosing a proper validation strategy in predictive models. A tutorial with real examples

Anal Chim Acta. 2023 Sep 22:1275:341532. doi: 10.1016/j.aca.2023.341532. Epub 2023 Jun 17.

Authors

Eneko Lopez¹, Jaione Etxebarria-Elezgarai², Jose Manuel Amigo³, Andreas Seifert⁴

Affiliations

¹ CIC NanoGUNE BRTA, Tolosa Hiribidea 76, San Sebastián, 20018, Spain; Department of Physics, University of the Basque Country (UPV/EHU), San Sebastián, 20018, Spain.
² CIC NanoGUNE BRTA, Tolosa Hiribidea 76, San Sebastián, 20018, Spain.
³ IKERBASQUE, Basque Foundation for Science, Plaza Euskadi, 5, Bilbao, 48009, Spain; Department of Analytical Chemistry, University of the Basque Country, Barrio Sarriena S/N, Leioa, 48940, Spain. Electronic address: josemanuel.amigo@ehu.eus.
⁴ CIC NanoGUNE BRTA, Tolosa Hiribidea 76, San Sebastián, 20018, Spain; IKERBASQUE, Basque Foundation for Science, Plaza Euskadi, 5, Bilbao, 48009, Spain. Electronic address: a.seifert@nanogune.eu.

PMID: 37524478
DOI: 10.1016/j.aca.2023.341532

Abstract

Machine learning is the art of combining a set of measurement data and predictive variables to forecast future events. Every day, new model approaches (with high levels of sophistication) can be found in the literature. However, less importance is given to the crucial stage of validation. Validation is the assessment that the model reliably links the measurements and the predictive variables. Nevertheless, there are many ways in which a model can be validated and cross-validated reliably, but still, it may be a model that wrongly reflects the real nature of the data and cannot be used to predict external samples. This manuscript shows in a didactical manner how important the data structure is when a model is constructed and how easy it is to obtain models that look promising with wrong-designed cross-validation and external validation strategies. A comprehensive overview of the main validation strategies is shown, exemplified by three different scenarios, all of them focused on classification.

Keywords: Bootstrap; Cross-validation; Jackknife; PLS-DA; Permutation test; Resampling; Validation.

Publication types

Review