Understanding the limitations of next generation sequencing informatics, an approach to clinical pipeline validation using artificial data sets

Robert Daber; Shrey Sukhadia; Jennifer J D Morrissette

doi:10.1016/j.cancergen.2013.11.005

Understanding the limitations of next generation sequencing informatics, an approach to clinical pipeline validation using artificial data sets

Cancer Genet. 2013 Dec;206(12):441-8. doi: 10.1016/j.cancergen.2013.11.005. Epub 2013 Nov 28.

Authors

Robert Daber¹, Shrey Sukhadia², Jennifer J D Morrissette²

Affiliations

¹ Center for Personalized Diagnostics, University of Pennsylvania School of Medicine, Philadelphia, PA. Electronic address: Robert.Daber@uphs.upenn.edu.
² Center for Personalized Diagnostics, University of Pennsylvania School of Medicine, Philadelphia, PA.

PMID: 24528889
DOI: 10.1016/j.cancergen.2013.11.005

Abstract

The advantages of massively parallel sequencing are quickly being realized through the adoption of comprehensive genomic panels across the spectrum of genetic testing. Despite such widespread utilization of next generation sequencing (NGS), a major bottleneck in the implementation and capitalization of this technology remains in the data processing steps, or bioinformatics. Here we describe our approach to defining the limitations of each step in the data processing pipeline by utilizing artificial amplicon data sets to simulate a wide spectrum of genomic alterations. Through this process, we identified limitations of insertion, deletion (indel), and single nucleotide variant (SNV) detection using standard approaches and described novel strategies to improve overall somatic mutation detection. Using these artificial data sets, we were able to demonstrate that NGS assays can have robust mutation detection if the data can be processed in a way that does not lead to large genomic alterations landing in the unmapped data (i.e., trash). By using these pipeline modifications and a new variant caller, AbsoluteVar, we have been able to validate SNV mutation detection to 100% sensitivity and specificity with an allele frequency as low 4% and detection of indels as large as 90 bp. Clinical validation of NGS relies on the ability for mutation detection across a wide array of genetic anomalies, and the utility of artificial data sets demonstrates a mechanism to intelligently test a vast array of mutation types.

Keywords: Next generation sequencing; artificial data set; bioinformatics; sensitivity; validation.

Publication types

Review

MeSH terms

Data Collection
High-Throughput Nucleotide Sequencing / methods*
Humans
Informatics / methods
Sensitivity and Specificity
Sequence Analysis, DNA / methods*