Using Machine Learning to Identify True Somatic Variants from Next-Generation Sequencing

Clin Chem. 2020 Jan 1;66(1):239-246. doi: 10.1373/clinchem.2019.308213.

Abstract

Background: Molecular profiling has become essential for tumor risk stratification and treatment selection. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. Currently, clinical laboratories rely on manual screening, which is costly, subjective, and not scalable. We present a machine learning-based method to distinguish artifacts from bona fide single-nucleotide variants (SNVs) detected by next-generation sequencing from nonformalin-fixed paraffin-embedded tumor specimens.

Methods: A cohort of 11278 SNVs identified through clinical sequencing of tumor specimens was collected and divided into training, validation, and test sets. Each SNV was manually inspected and labeled as either real or artifact as part of clinical laboratory workflow. A 3-class (real, artifact, and uncertain) model was developed on the training set, fine-tuned with the validation set, and then evaluated on the test set. Prediction intervals reflecting the certainty of the classifications were derived during the process to label "uncertain" variants.

Results: The optimized classifier demonstrated 100% specificity and 97% sensitivity over 5587 SNVs of the test set. Overall, 1252 of 1341 true-positive variants were identified as real, 4143 of 4246 false-positive calls were deemed artifacts, whereas only 192 (3.4%) SNVs were labeled as "uncertain," with zero misclassification between the true positives and artifacts in the test set.

Conclusions: We presented a computational classifier to identify variant artifacts detected from tumor sequencing. Overall, 96.6% of the SNVs received definitive labels and thus were exempt from manual review. This framework could improve quality and efficiency of the variant review process in clinical laboratories.

MeSH terms

  • False Positive Reactions
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Machine Learning*
  • Neoplasms / diagnosis
  • Neoplasms / genetics
  • Polymorphism, Single Nucleotide
  • Sensitivity and Specificity