hg19KIndel: ethnicity normalized human reference genome

BMC Genomics. 2019 Jun 6;20(1):459. doi: 10.1186/s12864-019-5854-3.

Abstract

Background: The most widely used human genome reference assembly hg19 harbors minor alleles at 2.18 million positions as revealed by 1000 Genome Phase 3 dataset. Although this is less than 2% of the 89 million variants reported, it has been shown that the minor alleles can result in 30% false positives in individual genomes, thus misleading and burdening downstream interpretation. More alarming is the fact that, significant percentage of variants that are homozygous recessive for these minor alleles, with potential disease implications, are masked from reporting.

Results: We have demonstrated that the false positives (FP) and false negatives (FN) can be corrected for by simply replacing nucleotides at the minor allele positions in hg19 with corresponding major allele. Here, we have effectively replaced 2.18 million minor alleles Single Nucleotide Polymorphism (SNPs), Insertion and Deletions (INDELs), Multiple Nucleotide Polymorphism (MNPs) in hg19 with the corresponding major alleles to create an ethnically normalized reference genome called hg19KIndel. In doing so, hg19KIndel has both corrected for sequencing errors acknowledged to be present in hg19 and has improved read alignment near the minor alleles in hg19.

Conclusion: We have created and made available a new version human reference genome called hg19KIndel. It has been shown that variant calling using hg19KIndel, significantly reduces false positives calls, which in-turn reduces the burden from downstream analysis and validation. It also improved false negative variants call, which means that the variants which were getting missed due to the presence of minor alleles in hg19, will now be called using hg19KIndel. Using hg19KIndel, one even gets a better mapping percentage when compared to currently available human reference genome. hg19KIndel reference genome and its auxiliary datasets are available at https://doi.org/10.5281/zenodo.2638113.

Keywords: Disease predisposition; Human reference genome; Major and minor alleles; Population study; Variant calling.

MeSH terms

  • Alleles
  • Databases, Nucleic Acid
  • Ethnicity / genetics*
  • Genetic Variation*
  • Genome, Human*
  • Humans
  • INDEL Mutation
  • Polymorphism, Single Nucleotide
  • Reference Standards
  • Sequence Analysis, DNA