Instability of Variable-selection Algorithms Used to Identify True Predictors of an Outcome in Intermediate-dimension Epidemiologic Studies

Solène Cadiou; Rémy Slama

doi:10.1097/EDE.0000000000001340

Instability of Variable-selection Algorithms Used to Identify True Predictors of an Outcome in Intermediate-dimension Epidemiologic Studies

Epidemiology. 2021 May 1;32(3):402-411. doi: 10.1097/EDE.0000000000001340.

Authors

Solène Cadiou¹, Rémy Slama

Affiliation

¹ From the Team of Environmental Epidemiology, IAB, Institute for Advanced Biosciences, Inserm, CNRS, CHU-Grenoble-Alpes, University Grenoble-Alpes, Grenoble, France.

PMID: 33652445
DOI: 10.1097/EDE.0000000000001340

Abstract

Background: Machine-learning algorithms are increasingly used in epidemiology to identify true predictors of a health outcome when many potential predictors are measured. However, these algorithms can provide different outputs when repeatedly applied to the same dataset, which can compromise research reproducibility. We aimed to illustrate that commonly used algorithms are unstable and, using the example of Least Absolute Shrinkage and Selection Operator (LASSO), that stabilization method choice is crucial.

Methods: In a simulation study, we tested the stability and performance of widely used machine-learning algorithms (LASSO, Elastic-Net, and Deletion-Substitution-Addition [DSA]). We then assessed the effectiveness of six methods to stabilize LASSO and their impact on performance. We assumed that a linear combination of factors drawn from a simulated set of 173 quantitative variables assessed in 1,301 subjects influenced to varying extents a continuous health outcome. We assessed model stability, sensitivity, and false discovery proportion.

Results: All tested algorithms were unstable. For LASSO, stabilization methods improved stability without ensuring perfect stability, a finding confirmed by application to an exposome study. Stabilization methods also affected performance. Specifically, stabilization based on hyperparameter optimization, frequently implemented in epidemiology, increased the false discovery proportion dramatically when predictors explained a low share of outcome variability. In contrast, stabilization based on stability selection procedure often decreased the false discovery proportion, while sometimes simultaneously lowering sensitivity.

Conclusions: Machine-learning methods instability should concern epidemiologists relying on them for variable selection, as stabilizing a model can impact its performance. For LASSO, stabilization methods based on stability selection procedure (rather than addressing prediction stability) should be preferred to identify true predictors.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Computer Simulation
Epidemiologic Studies
Humans
Machine Learning*
Reproducibility of Results