SoloDel: a probabilistic model for detecting low-frequent somatic deletions from unmatched sequencing data

Bioinformatics. 2015 Oct 1;31(19):3105-13. doi: 10.1093/bioinformatics/btv358. Epub 2015 Jun 11.

Abstract

Motivation: Finding somatic mutations from massively parallel sequencing data is becoming a standard process in genome-based biomedical studies. There are a number of robust methods developed for detecting somatic single nucleotide variations However, detection of somatic copy number alteration has been substantially less explored and remains vulnerable to frequently raised sampling issues: low frequency in cell population and absence of the matched control samples.

Results: We developed a novel computational method SoloDel that accurately classifies low-frequent somatic deletions from germline ones with or without matched control samples. We first constructed a probabilistic, somatic mutation progression model that describes the occurrence and propagation of the event in the cellular lineage of the sample. We then built a Gaussian mixture model to represent the mixed population of somatic and germline deletions. Parameters of the mixture model could be estimated using the expectation-maximization algorithm with the observed distribution of read-depth ratios at the points of discordant-read based initial deletion calls. Combined with conventional structural variation caller, SoloDel greatly increased the accuracy in classifying somatic mutations. Even without control, SoloDel maintained a comparable performance in a wide range of mutated subpopulation size (10-70%). SoloDel could also successfully recall experimentally validated somatic deletions from previously reported neuropsychiatric whole-genome sequencing data.

Availability and implementation: Java-based implementation of the method is available at http://sourceforge.net/projects/solodel/

Contact: swkim@yuhs.ac or dhlee@biosoft.kaist.ac.kr

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Computer Simulation
  • Databases, Genetic
  • Humans
  • Mental Disorders / genetics
  • Models, Statistical*
  • Reproducibility of Results
  • Sequence Analysis, DNA / methods*
  • Sequence Deletion / genetics*
  • Software*