NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

Genome Biol. 2021 Sep 6;22(1):261. doi: 10.1186/s13059-021-02472-2.

Abstract

Long-read sequencing enables variant detection in genomic regions that are considered difficult-to-map by short-read sequencing. To fully exploit the benefits of longer reads, here we present a deep learning method NanoCaller, which detects SNPs using long-range haplotype information, then phases long reads with called SNPs and calls indels with local realignment. Evaluation on 8 human genomes demonstrates that NanoCaller generally achieves better performance than competing approaches. We experimentally validate 41 novel variants in a widely used benchmarking genome, which could not be reliably detected previously. In summary, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing.

Keywords: Deep learning; Difficult-to-map regions; Long-range haplotype; Variant calling.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Alleles
  • Base Sequence
  • Benchmarking
  • Chromosome Mapping
  • Genome, Human
  • Haplotypes / genetics*
  • High-Throughput Nucleotide Sequencing*
  • Humans
  • INDEL Mutation / genetics*
  • Major Histocompatibility Complex / genetics
  • Nanoparticles / chemistry*
  • Nanopore Sequencing
  • Neural Networks, Computer*
  • Polymorphism, Single Nucleotide / genetics*