Inferring feature importance with uncertainties with application to large genotype data

Pål Vegard Johnsen; Inga Strümke; Mette Langaas; Andrew Thomas DeWan; Signe Riemer-Sørensen

doi:10.1371/journal.pcbi.1010963

Inferring feature importance with uncertainties with application to large genotype data

PLoS Comput Biol. 2023 Mar 14;19(3):e1010963. doi: 10.1371/journal.pcbi.1010963. eCollection 2023 Mar.

Authors

Pål Vegard Johnsen^{1

2}, Inga Strümke^{3

4}, Mette Langaas², Andrew Thomas DeWan⁵, Signe Riemer-Sørensen¹

Affiliations

¹ SINTEF DIGITAL, Oslo, Norway.
² Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway.
³ Department of Engineering Cybernetics, Norwegian University of Science and Technology, Trondheim, Norway.
⁴ Department of Holistic Systems, SimulaMet, Oslo, Norway.
⁵ Department of Chronic Disease Epidemiology and Center for Perinatal, Pediatric and Environmental Epidemiology, Yale School of Public Health, New Haven, Connecticut, United States of America.

Abstract

Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.

Copyright: © 2023 Johnsen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Genotyping Techniques*
Uncertainty*

Grants and funding

This research was funded by The Research Council of Norway (https://www.forskningsradet.no/en/), Grant 272402, Ph.D. Scholarship at SINTEF, including funding for a research stay abroad at Yale School of Public Health to PVJ. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.