Somatic and Germline Variant Calling from Next-Generation Sequencing Data

Adv Exp Med Biol. 2022:1361:37-54. doi: 10.1007/978-3-030-91836-1_3.

Abstract

Re-sequencing of the human genome by next-generation sequencing (NGS) has been widely applied to discover pathogenic genetic variants and/or causative genes accounting for various types of diseases including cancers. The advances in NGS have allowed the sequencing of the entire genome of patients and identification of disease-associated variants in a reasonable timeframe and cost. The core of the variant identification relies on accurate variant calling and annotation. Numerous algorithms have been developed to elucidate the repertoire of somatic and germline variants. Each algorithm has its own distinct strengths, weaknesses, and limitations due to the difference in the statistical modeling approach adopted and read information utilized. Accurate variant calling remains challenging due to the presence of sequencing artifacts and read misalignments. All of these can lead to the discordance of the variant calling results and even misinterpretation of the discovery. For somatic variant detection, multiple factors including chromosomal abnormalities, tumor heterogeneity, tumor-normal cross contaminations, unbalanced tumor/normal sample coverage, and variants with low allele frequencies add even more layers of complexity to accurate variant identification. Given the discordances and difficulties, ensemble approaches have emerged by harmonizing information from different algorithms to improve variant calling performance. In this chapter, we first introduce the general scheme of variant calling algorithms and potential challenges at distinct stages. We next review the existing workflows of variant calling and annotation, and finally explore the strategies deployed by different callers as well as their strengths and caveats. Overall, NGS-based variant identification with careful consideration allows reliable detection of pathogenic variant and candidate variant selection for precision medicine.

Keywords: Contamination; Ensemble variant calling; Germline variant; Low-frequency variants; Machine learning; Next-generation sequencing; Single-cell sequencing; Somatic variant; Third-generation sequencing; Tumor-only variant calling; Variant annotation; Variant calling; Variant prioritization.

MeSH terms

  • Algorithms
  • Genome, Human*
  • Germ Cells
  • High-Throughput Nucleotide Sequencing* / methods
  • Humans
  • Models, Statistical
  • Software