SMaSH: Sample matching using SNPs in humans

Maximillian Westphal; David Frankhouser; Carmine Sonzone; Peter G Shields; Pearlly Yan; Ralf Bundschuh

doi:10.1186/s12864-019-6332-7

SMaSH: Sample matching using SNPs in humans

BMC Genomics. 2019 Dec 30;20(Suppl 12):1001. doi: 10.1186/s12864-019-6332-7.

Authors

Maximillian Westphal¹, David Frankhouser^{2

3}, Carmine Sonzone⁴, Peter G Shields^{4

5

6}, Pearlly Yan^{5

6}, Ralf Bundschuh^{7

8

9

10

11}

Affiliations

¹ Interdisciplinary Biophysics Graduate Program, The Ohio State University, 484 W. 12th Avenue, Columbus, 43210, OH, USA.
² Biomedical Science Graduate Program, The Ohio State University, 333 W. 10th Avenue, Columbus, 43210, OH, USA.
³ Department of Diabetes Complications and Metabolism and Department of Population Sciences in the Beckman Research Institute, City of Hope, 1500 East Duarte Road, Duarte, 91010, CA, USA.
⁴ Molecular, Cellular, and Developmental Biology Graduate Program, The Ohio State University, 484 W. 12th Avenue, Columbus, 43210, OH, USA.
⁵ Department of Internal Medicine, The Ohio State University, 395 W. 12th Avenue, Columbus, 43210, OH, USA.
⁶ Comprehensive Cancer Center, The Ohio State University, 460 W. 10th Avenue, Columbus, 43210, OH, USA.
⁷ Interdisciplinary Biophysics Graduate Program, The Ohio State University, 484 W. 12th Avenue, Columbus, 43210, OH, USA. bundschuh@mps.ohio-state.edu.
⁸ Department of Internal Medicine, The Ohio State University, 395 W. 12th Avenue, Columbus, 43210, OH, USA. bundschuh@mps.ohio-state.edu.
⁹ Department of Physics, The Ohio State University, 191 W. Woodruff Avenue, Columbus, 43210, OH, USA. bundschuh@mps.ohio-state.edu.
¹⁰ Department of Chemistry and Biochemistry, The Ohio State University, 100 W. 18th Avenue, Columbus, 43210, OH, USA. bundschuh@mps.ohio-state.edu.
¹¹ Center for RNA Biology, The Ohio State University, 484 W. 12th Avenue, Columbus, 43210, OH, USA. bundschuh@mps.ohio-state.edu.

Abstract

Background: Inadvertent sample swaps are a real threat to data quality in any medium to large scale omics studies. While matches between samples from the same individual can in principle be identified from a few well characterized single nucleotide polymorphisms (SNPs), omics data types often only provide low to moderate coverage, thus requiring integration of evidence from a large number of SNPs to determine if two samples derive from the same individual or not.

Methods: We select about six thousand SNPs in the human genome and develop a Bayesian framework that is able to robustly identify sample matches between next generation sequencing data sets.

Results: We validate our approach on a variety of data sets. Most importantly, we show that our approach can establish identity between different omics data types such as Exome, RNA-Seq, and MethylCap-Seq. We demonstrate how identity detection degrades with sample quality and read coverage, but show that twenty million reads of a fairly low quality RNA-Seq sample are still sufficient for reliable sample identification.

Conclusion: Our tool, SMASH, is able to identify sample mismatches in next generation sequencing data sets between different sequencing modalities and for low quality sequencing data.

Keywords: Identity matching; Next generation sequencing data; Sample swap.

MeSH terms

Bayes Theorem
Genome, Human / genetics
Genomics / methods*
High-Throughput Nucleotide Sequencing
Humans
Polymorphism, Single Nucleotide / genetics*
Reproducibility of Results
Sequence Analysis, DNA
Software*

Abstract

MeSH terms

Grants and funding