The importance of being external. methodological insights for the external validation of machine learning models in medicine

Federico Cabitza; Andrea Campagner; Felipe Soares; Luis García de Guadiana-Romualdo; Feyissa Challa; Adela Sulejmani; Michela Seghezzi; Anna Carobene

doi:10.1016/j.cmpb.2021.106288

The importance of being external. methodological insights for the external validation of machine learning models in medicine

Comput Methods Programs Biomed. 2021 Sep:208:106288. doi: 10.1016/j.cmpb.2021.106288. Epub 2021 Jul 22.

Authors

Federico Cabitza¹, Andrea Campagner², Felipe Soares³, Luis García de Guadiana-Romualdo⁴, Feyissa Challa⁵, Adela Sulejmani⁶, Michela Seghezzi⁷, Anna Carobene⁸

Affiliations

¹ University of Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy. Electronic address: federico.cabitza@unimib.it.
² University of Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy.
³ Department of Industrial Engineering - Universidade Federal do Rio Grande do Sul. Porto Alegre, Brazil.
⁴ Laboratory Medicine Department, Hospital Universitario Santa Lucia, Cartagena, Spain.
⁵ National Reference Laboratory for Clinical Chemistry, Ethiopian Public Health Institute, Addis Ababa, Ethiopia.
⁶ Laboratorio di chimica clinica, Ospedale di Desio e Monza, ASST-Monza, Dipartimento di medicina e chirurgia, Universit di Milano-Bicocca, Monza, Italy.
⁷ Laboratorio di chimica clinica, Ospedale Papa Giovanni XXIII, Bergamo, Italy.
⁸ Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Milan, Italy.

PMID: 34352688
DOI: 10.1016/j.cmpb.2021.106288

Abstract

Background and Objective Medical machine learning (ML) models tend to perform better on data from the same cohort than on new data, often due to overfitting, or co-variate shifts. For these reasons, external validation (EV) is a necessary practice in the evaluation of medical ML. However, there is still a gap in the literature on how to interpret EV results and hence assess the robustness of ML models.

Methods: We fill this gap by proposing a meta-validation method, to assess the soundness of EV procedures. In doing so, we complement the usual way to assess EV by considering both dataset cardinality, and the similarity of the EV dataset with respect to the training set. We then investigate how the notions of cardinality and similarity can be used to inform on the reliability of a validation procedure, by integrating them into two summative data visualizations.

Results: We illustrate our methodology by applying it to the validation of a state-of-the-art COVID-19 diagnostic model on 8 EV sets, collected across 3 different continents. The model performance was moderately impacted by data similarity (Pearson ρ = 0.38, p< 0.001). In the EV, the validated model reported good AUC (average: 0.84), acceptable calibration (average: 0.17) and utility (average: 0.50). The validation datasets were adequate in terms of dataset cardinality and similarity, thus suggesting the soundness of the results. We also provide a qualitative guideline to evaluate the reliability of validation procedures, and we discuss the importance of proper external validation in light of the obtained results.

Conclusions: In this paper, we propose a novel, lean methodology to: 1) study how the similarity between training and validation sets impacts the generalizability of a ML model; 2) assess the soundness of EV evaluations along three complementary performance dimensions: discrimination, utility and calibration; 3) draw conclusions on the robustness of the model under validation. We applied this methodology to a state-of-the-art model for the diagnosis of COVID-19 from routine blood tests, and showed how to interpret the results in light of the presented framework.

Keywords: COVID-19; Dataset cardinality; Dataset similarity; Medical machine learning; Validation.

MeSH terms

COVID-19*
Cohort Studies
Humans
Machine Learning
Reproducibility of Results
SARS-CoV-2