A novel machine learning approach (svmSomatic) to distinguish somatic and germline mutations using next-generation sequencing data

Zool Res. 2021 Mar 18;42(2):246-249. doi: 10.24272/j.issn.2095-8137.2021.014.

Abstract

Somatic mutations are a large category of genetic variations, which play an essential role in tumorigenesis. Detection of somatic single nucleotide variants (SNVs) could facilitate downstream analysis of tumorigenesis. Many computational methods have been developed to detect SNVs, but most require normal matched samples to differentiate somatic SNVs from the normal state, which can be difficult to obtain. Therefore, developing new approaches for detecting somatic SNVs without matched samples are crucial. In this work, we detected somatic mutations from individual tumor samples based on a novel machine learning approach, svmSomatic, using next-generation sequencing (NGS) data. In addition, as somatic SNV detection can be impacted by multiple mutations, with germline mutations and co-occurrence of copy number variations (CNVs) common in organisms, we used the novel approach to distinguish somatic and germline mutations based on the NGS data from individual tumor samples. In summary, svmSomatic: (1) considers the influence of CNV co-occurrence in detecting somatic mutations; and (2) trains a support vector machine algorithm to distinguish between somatic and germline mutations, without requiring normal matched samples. We further tested and compared svmSomatic with other common methods. Results showed that svmSomatic performance, as measured by F1-score, was significantly better than that of others using both simulation and real NGS data.

体细胞突变是癌症基因组中一种主要的变异类型,它与肿瘤的产生与发展有密切联系。单核苷酸变异(SNVs)的检测可以促进肿瘤研究的下游分析。目前已经有许多方法来检测SNVs,但大多数方法都需要癌症样本有与之匹配正常样本才能将体细胞变异检测出来,但与之配对的正常样本通常不容易获得。因此,发展新的方法对肿瘤单样本数据进行体细胞变异的检测至关重要。在这项工作中,我们发展了一个新的机器学习方法用于精确检测单个肿瘤样本的新一代测序数据中的体细胞突变。在体细胞变异检测中要考虑的另一点是多种变异同时存在的情形,即肿瘤细胞内拷贝数变异(CNV)和SNV的共同出现是很常见。因此,我们提出了一种新的机器学习模型svmSomatic,该方法可以根把单个肿瘤样本的基因组数据中的体细胞突变与种系突变区分开。svmSomatic的新特点包括:1)考虑了CNV的对检测体细胞变异的影响;2)在单肿瘤样本数据中,采用支持向量机(SVM)的训练结果作为分类器来区分体细胞变异和种系变异。我们在基因组的模拟数据和真实数据中测试了svmSomatic,并将其与其它同类方法进行了比较。这些模拟和比较结果表明,在F1-score的综合评价下,svmSomatic与其它方法相比在模拟数据和真实数据中都表现出了较好的性能。.

Keywords: Copy number variants; Germline mutation; Next-generation sequencing; Single nucleotide variations; Somatic mutation; Support vector machine.

Publication types

  • Letter

MeSH terms

  • Algorithms
  • Animals
  • Computational Biology / methods
  • DNA Copy Number Variations
  • Gene Expression Regulation, Neoplastic
  • High-Throughput Nucleotide Sequencing / methods
  • Humans
  • Machine Learning*
  • Mutation / genetics*
  • Neoplasms / genetics*
  • Neoplasms / metabolism

Grants and funding

This study was supported by the CAS Pioneer Hundred Talents Program and National Natural Science Foundation of China (32070683) to Y.P.C