Data leakage inflates prediction performance in connectome-based machine learning models

Matthew Rosenblatt; Link Tejavibulya; Rongtao Jiang; Stephanie Noble; Dustin Scheinost

doi:10.1038/s41467-024-46150-w

Data leakage inflates prediction performance in connectome-based machine learning models

Nat Commun. 2024 Feb 28;15(1):1829. doi: 10.1038/s41467-024-46150-w.

Authors

Matthew Rosenblatt¹, Link Tejavibulya², Rongtao Jiang³, Stephanie Noble^{3

4

5}, Dustin Scheinost^{6

2

3

7

8}

Affiliations

¹ Department of Biomedical Engineering, Yale University, New Haven, CT, USA. matthew.rosenblatt@yale.edu.
² Interdepartmental Neuroscience Program, Yale University, New Haven, CT, USA.
³ Department of Radiology & Biomedical Imaging, Yale School of Medicine, New Haven, CT, USA.
⁴ Department of Bioengineering, Northeastern University, Boston, MA, USA.
⁵ Department of Psychology, Northeastern University, Boston, MA, USA.
⁶ Department of Biomedical Engineering, Yale University, New Haven, CT, USA.
⁷ Child Study Center, Yale School of Medicine, New Haven, CT, USA.
⁸ Department of Statistics & Data Science, Yale University, New Haven, CT, USA.

Abstract

Predictive modeling is a central technique in neuroimaging to identify brain-behavior relationships and test their generalizability to unseen data. However, data leakage undermines the validity of predictive models by breaching the separation between training and test data. Leakage is always an incorrect practice but still pervasive in machine learning. Understanding its effects on neuroimaging predictive models can inform how leakage affects existing literature. Here, we investigate the effects of five forms of leakage-involving feature selection, covariate correction, and dependence between subjects-on functional and structural connectome-based machine learning models across four datasets and three phenotypes. Leakage via feature selection and repeated subjects drastically inflates prediction performance, whereas other forms of leakage have minor effects. Furthermore, small datasets exacerbate the effects of leakage. Overall, our results illustrate the variable effects of leakage and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.

MeSH terms

Brain / diagnostic imaging
Connectome* / methods
Humans
Machine Learning
Magnetic Resonance Imaging / methods
Neuroimaging / methods
Reproducibility of Results

Abstract

MeSH terms

Grants and funding