SNP variable selection by generalized graph domination

PLoS One. 2019 Jan 24;14(1):e0203242. doi: 10.1371/journal.pone.0203242. eCollection 2019.

Abstract

Background: High-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the p≫n problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models.

Methods and findings: K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum k-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength of k-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi optimization solver for the k-dominating set variable selection is available (https://github.com/transgenomicsosu/SNP-SELECT).

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • Animals
  • Genome-Wide Association Study*
  • Linkage Disequilibrium*
  • Mice
  • Models, Genetic*
  • Pedigree*
  • Polymorphism, Single Nucleotide*

Grants and funding

This research is funded by Oklahoma Wheat Research Foundation, OCAST (PS15-011) and NSF-MRI 1626257 (CC), NSF-IOS 1558109 (CC and PC), NSF-CMMI 1404971 (BB), and a fellowship from the Cornell Lab of Ornithology (BP). The work presented in this report also reflects the support from the USDA HATCH project OKL03011 (CC). SS, BR, YAE and CC acknowledge cash funding for this research from Genome Canada, Genome Alberta through Alberta Economic Trade and Development, Genome British Columbia, the University of Alberta and University of Calgary and others, including the Alberta forest industry in support of the Resilient Forests (RES-FOR): Climate, Pests & Policy- Genomic Applications project. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.