SM-RCNV: a statistical method to detect recurrent copy number variations in sequenced samples

Genes Genomics. 2019 May;41(5):529-536. doi: 10.1007/s13258-019-00788-9. Epub 2019 Feb 18.

Abstract

Background: Copy number variation (CNV) is an important form of genomic structural variation and is linked to dozens of human diseases. Using next-generation sequencing (NGS) data and developing computational methods to characterize such structural variants is significant for understanding the mechanisms of diseases.

Objective: The objective of this study is to develop a new statistical method of detection recurrent CNVs across multiple samples from genomic sequences.

Methods: A statistical method is carried out to detect recurrent CNVs, referred to as SM-RCNV. This method uses a statistic associated with each location by combining the frequency of variation at one location across whole samples and the correlation among consecutive locations. The weights of the frequency and correlation are trained using real datasets with known CNVs. P-value is assessed for each location on the genome by permutation testing.

Results: Compared with six peer methods, SM-RCNV outperforms the peer methods under receiver operating characteristic curves. SM-RCNV successfully identifies many consistent recurrent CNVs, most of which are known to be of biological significance and associated with diseased genes. The validation rate of SM-RCNV in the CEU call set and YRI call set with Database of Genomic Variants are 258/328 (79%) and (157/309) 51%, respectively.

Conclusion: SM-RCNV is a well-grounded statistical framework for detecting recurrent CNVs from multiple genomic sequences, providing valuable information to study genomes in human diseases. The source code is freely available at https://sourceforge.net/projects/sm-rcnv/ .

Keywords: Correlation; Permutation test; Read depth; Recurrent copy number variations.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Base Sequence / genetics
  • Computer Simulation
  • DNA Copy Number Variations / genetics*
  • Data Interpretation, Statistical
  • Genome, Human / genetics
  • Genomics / methods
  • High-Throughput Nucleotide Sequencing / methods
  • Humans
  • ROC Curve
  • Sequence Analysis, DNA / methods*
  • Sequence Analysis, DNA / statistics & numerical data
  • Software