Statistical learning for sparser fine-mapped polygenic models: The prediction of LDL-cholesterol

Carlo Maj; Christian Staerk; Oleg Borisov; Hannah Klinkhammer; Ming Wai Yeung; Peter Krawitz; Andreas Mayr

doi:10.1002/gepi.22495

Statistical learning for sparser fine-mapped polygenic models: The prediction of LDL-cholesterol

Genet Epidemiol. 2022 Dec;46(8):589-603. doi: 10.1002/gepi.22495. Epub 2022 Aug 8.

Authors

Carlo Maj^{1

2}, Christian Staerk³, Oleg Borisov¹, Hannah Klinkhammer^{1

3}, Ming Wai Yeung^{1

4}, Peter Krawitz¹, Andreas Mayr³

Affiliations

¹ Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University Bonn, Bonn, Germany.
² Centre for Human Genetics, University of Marburg, Marburg, Germany.
³ Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University Bonn, Bonn, Germany.
⁴ Department of Cardiology, University of Groningen, Groningen, The Netherlands.

PMID: 35938382
DOI: 10.1002/gepi.22495

Abstract

Polygenic risk scores quantify the individual genetic predisposition regarding a particular trait. We propose and illustrate the application of existing statistical learning methods to derive sparser models for genome-wide data with a polygenic signal. Our approach is based on three consecutive steps. First, potentially informative loci are identified by a marginal screening approach. Then, fine-mapping is independently applied for blocks of variants in linkage disequilibrium, where informative variants are retrieved by using variable selection methods including boosting with probing and stochastic searches with the Adaptive Subspace method. Finally, joint prediction models with the selected variants are derived using statistical boosting. In contrast to alternative approaches relying on univariate summary statistics from genome-wide association studies, our three-step approach enables to select and fit multivariable regression models on large-scale genotype data. Based on UK Biobank data, we develop prediction models for LDL-cholesterol as a continuous trait. Additionally, we consider a recent scalable algorithm for the Lasso. Results show that statistical learning approaches based on fine-mapping of genetic signals result in a competitive prediction performance compared to classical polygenic risk approaches, while yielding sparser risk models.

Keywords: UK Biobank; boosting; polygenic score; stochastic search; variable selection.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Cholesterol, LDL / genetics
Genome-Wide Association Study* / methods
Humans
Models, Genetic
Multifactorial Inheritance / genetics
Polymorphism, Single Nucleotide*

Substances

Cholesterol, LDL