Use and misuse of random forest variable importance metrics in medicine: demonstrations through incident stroke prediction

Meredith L Wallace; Lucas Mentch; Bradley J Wheeler; Amanda L Tapia; Marc Richards; Siyu Zhou; Lixia Yi; Susan Redline; Daniel J Buysse

doi:10.1186/s12874-023-01965-x

Use and misuse of random forest variable importance metrics in medicine: demonstrations through incident stroke prediction

BMC Med Res Methodol. 2023 Jun 19;23(1):144. doi: 10.1186/s12874-023-01965-x.

Authors

Meredith L Wallace^{1

2}, Lucas Mentch³, Bradley J Wheeler⁴, Amanda L Tapia⁵, Marc Richards³, Siyu Zhou³, Lixia Yi³, Susan Redline⁶, Daniel J Buysse⁵

Affiliations

¹ Department of Psychiatry, University of Pittsburgh, 3811 O'Hara Street, Pittsburgh, PA, 15231, USA. lotzmj@upmc.edu.
² Department of Statistics, University of Pittsburgh, Pittsburgh, PA, USA. lotzmj@upmc.edu.
³ Department of Statistics, University of Pittsburgh, Pittsburgh, PA, USA.
⁴ School of Computing and Information, University of Pittsburgh, Pittsburgh, PA, USA.
⁵ Department of Psychiatry, University of Pittsburgh, 3811 O'Hara Street, Pittsburgh, PA, 15231, USA.
⁶ Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.

Abstract

Background: Machine learning tools such as random forests provide important opportunities for modeling large, complex modern data generated in medicine. Unfortunately, when it comes to understanding why machine learning models are predictive, applied research continues to rely on 'out of bag' (OOB) variable importance metrics (VIMPs) that are known to have considerable shortcomings within the statistics community. After explaining the limitations of OOB VIMPs - including bias towards correlated features and limited interpretability - we describe a modern approach called 'knockoff VIMPs' and explain its advantages.

Methods: We first evaluate current VIMP practices through an in-depth literature review of 50 recent random forest manuscripts. Next, we recommend organized and interpretable strategies for analysis with knockoff VIMPs, including computing them for groups of features and considering multiple model performance metrics. To demonstrate methods, we develop a random forest to predict 5-year incident stroke in the Sleep Heart Health Study and compare results based on OOB and knockoff VIMPs.

Results: Nearly all papers in the literature review contained substantial limitations in their use of VIMPs. In our demonstration, using OOB VIMPs for individual variables suggested two highly correlated lung function variables (forced expiratory volume, forced vital capacity) as the best predictors of incident stroke, followed by age and height. Using an organized analytic approach that considered knockoff VIMPs of both groups of features and individual features, the largest contributions to model sensitivity were medications (especially cardiovascular) and measured medical risk factors, while the largest contributions to model specificity were age, diastolic blood pressure, self-reported medical risk factors, polysomnography features, and pack-years of smoking. Thus, we reach very different conclusions about stroke risk factors using OOB VIMPs versus knockoff VIMPs.

Conclusions: The near-ubiquitous reliance on OOB VIMPs may provide misleading results for researchers who use such methods to guide their research. Given the rapid pace of scientific inquiry using machine learning, it is essential to bring modern knockoff VIMPs that are interpretable and unbiased into widespread applied practice to steer researchers using random forest machine learning toward more meaningful results.

Keywords: Feature importance; Knockoff variable importance; Polysomnography; Random forest; Sleep.

Publication types

Review
Research Support, Non-U.S. Gov't
Research Support, N.I.H., Extramural

MeSH terms

Benchmarking
Humans
Machine Learning
Random Forest*
Sleep
Stroke* / diagnosis
Stroke* / epidemiology

Abstract

Publication types

MeSH terms

Grants and funding