A multivariable approach for risk markers from pooled molecular data with only partial overlap

BMC Med Genet. 2019 Jul 19;20(1):128. doi: 10.1186/s12881-019-0849-0.

Abstract

Background: Increasingly, molecular measurements from multiple studies are pooled to identify risk scores, with only partial overlap of measurements available from different studies. Univariate analyses of such markers have routinely been performed in such settings using meta-analysis techniques in genome-wide association studies for identifying genetic risk scores. In contrast, multivariable techniques such as regularized regression, which might potentially be more powerful, are hampered by only partial overlap of available markers even when the pooling of individual level data is feasible for analysis. This cannot easily be addressed at a preprocessing level, as quality criteria in the different studies may result in differential availability of markers - even after imputation.

Methods: Motivated by data from the InterLymph Consortium on risk factors for non-Hodgkin lymphoma, which exhibits these challenges, we adapted a regularized regression approach, componentwise boosting, for dealing with partial overlap in SNPs. This synthesis regression approach is combined with resampling to determine stable sets of single nucleotide polymorphisms, which could feed into a genetic risk score. The proposed approach is contrasted with univariate analyses, an application of the lasso, and with an analysis that discards studies causing the partial overlap. The question of statistical significance is faced with an approach called stability selection.

Results: Using an excerpt of the data from the InterLymph Consortium on two specific subtypes of non-Hodgkin lymphoma, it is shown that componentwise boosting can take into account all applicable information from different SNPs, irrespective of whether they are covered by all investigated studies and for all individuals in the single studies. The results indicate increased power, even when studies that would be discarded in a complete case analysis only comprise a small proportion of individuals.

Conclusions: Given the observed gains in power, the proposed approach can be recommended more generally whenever there is only partial overlap of molecular measurements obtained from pooled studies and/or missing data in single studies. A corresponding software implementation is available upon request.

Trial registration: All involved studies have provided signed GWAS data submission certifications to the U.S. National Institute of Health and have been retrospectively registered.

Keywords: Consortium; Multivariable model; Partial overlap; Regularized regression; Single nucleotide polymorphism.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Data Analysis
  • Genome-Wide Association Study / methods*
  • Humans
  • Models, Theoretical*
  • Multivariate Analysis*
  • Polymorphism, Single Nucleotide
  • Regression Analysis
  • Research Design*
  • Risk Factors
  • Software