Deselection of base-learners for statistical boosting-with an application to distributional regression

Annika Strömer; Christian Staerk; Nadja Klein; Leonie Weinhold; Stephanie Titze; Andreas Mayr

doi:10.1177/09622802211051088

Deselection of base-learners for statistical boosting-with an application to distributional regression

Stat Methods Med Res. 2022 Feb;31(2):207-224. doi: 10.1177/09622802211051088. Epub 2021 Dec 9.

Authors

Annika Strömer¹, Christian Staerk¹, Nadja Klein², Leonie Weinhold¹, Stephanie Titze³, Andreas Mayr¹

Affiliations

¹ Department of Medical Biometrics, Informatics and Epidemiology, Faculty of Medicine, 9374University of Bonn, Germany.
² Emmy Noether Research Group in Statistics and Data Science, Humboldt-Universität zu Berlin, Germany.
³ Department of Nephrology and Hypertension, 9171FAU Erlangen-Nuremberg, Germany.

PMID: 34882438
DOI: 10.1177/09622802211051088

Abstract

We present a new procedure for enhanced variable selection for component-wise gradient boosting. Statistical boosting is a computational approach that emerged from machine learning, which allows to fit regression models in the presence of high-dimensional data. Furthermore, the algorithm can lead to data-driven variable selection. In practice, however, the final models typically tend to include too many variables in some situations. This occurs particularly for low-dimensional data ( $p < n$ ), where we observe a slow overfitting behavior of boosting. As a result, more variables get included into the final model without altering the prediction accuracy. Many of these false positives are incorporated with a small coefficient and therefore have a small impact, but lead to a larger model. We try to overcome this issue by giving the algorithm the chance to deselect base-learners with minor importance. We analyze the impact of the new approach on variable selection and prediction performance in comparison to alternative methods including boosting with earlier stopping as well as twin boosting. We illustrate our approach with data of an ongoing cohort study for chronic kidney disease patients, where the most influential predictors for the health-related quality of life measure are selected in a distributional regression approach based on beta regression.

Keywords: Beta regression; and shape; earlier stopping; generalized additive models for location; model-based boosting; scale; variable selection.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Cohort Studies
Humans
Longitudinal Studies
Machine Learning
Quality of Life*