Preprocessing Sequence Coverage Data for More Precise Detection of Copy Number Variations

IEEE/ACM Trans Comput Biol Bioinform. 2020 May-Jun;17(3):868-876. doi: 10.1109/TCBB.2018.2869738. Epub 2018 Sep 12.

Abstract

Copy number variation (CNV) is a type of genomic/genetic variation that plays an important role in phenotypic diversity, evolution, and disease susceptibility. Next generation sequencing (NGS) technologies have created an opportunity for more accurate detection of CNVs with higher resolution. However, efficient and precise detection of CNVs remains challenging due to high levels of noise and biases, data heterogeneity, and the "big data" nature of NGS data. Sequence coverage (readcount) data are mostly used for detecting CNVs, specially for whole exome sequencing data. Readcount data are contaminated with several types of biases and noise that hinder accurate detection of CNVs. In this work, we introduce a novel preprocessing pipeline for reducing noise and biases to improve the detection accuracy of CNVs in heterogeneous NGS data, such as cancer whole exome sequencing data. We have employed several normalization methods to reduce readcount's biases that are due to GC content of reads, read alignment problems, and sample impurity. We have also developed a novel efficient and effective smoothing approach based on Taut String to reduce noise and increase CNV detection power. Using simulated and real data we showed that employing the proposed preprocessing pipeline significantly improves the accuracy of CNV detection.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • DNA Copy Number Variations / genetics*
  • Exome Sequencing / methods*
  • Genome, Human / genetics
  • Genomics / methods*
  • Humans
  • Neoplasms / genetics
  • Signal Processing, Computer-Assisted