Increasing calling accuracy, coverage, and read-depth in sequence data by the use of haplotype blocks

PLoS Genet. 2021 Dec 23;17(12):e1009944. doi: 10.1371/journal.pgen.1009944. eCollection 2021 Dec.

Abstract

High-throughput genotyping of large numbers of lines remains a key challenge in plant genetics, requiring geneticists and breeders to find a balance between data quality and the number of genotyped lines under a variety of different existing genotyping technologies when resources are limited. In this work, we are proposing a new imputation pipeline ("HBimpute") that can be used to generate high-quality genomic data from low read-depth whole-genome-sequence data. The key idea of the pipeline is the use of haplotype blocks from the software HaploBlocker to identify locally similar lines and subsequently use the reads of all locally similar lines in the variant calling for a specific line. The effectiveness of the pipeline is showcased on a dataset of 321 doubled haploid lines of a European maize landrace, which were sequenced at 0.5X read-depth. The overall imputing error rates are cut in half compared to state-of-the-art software like BEAGLE and STITCH, while the average read-depth is increased to 83X, thus enabling the calling of copy number variation. The usefulness of the obtained imputed data panel is further evaluated by comparing the performance of sequence data in common breeding applications to that of genomic data generated with a genotyping array. For both genome-wide association studies and genomic prediction, results are on par or even slightly better than results obtained with high-density array data (600k). In particular for genomic prediction, we observe slightly higher data quality for the sequence data compared to the 600k array in the form of higher prediction accuracies. This occurred specifically when reducing the data panel to the set of overlapping markers between sequence and array, indicating that sequencing data can benefit from the same marker ascertainment as used in the array process to increase the quality and usability of genomic data.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • DNA Copy Number Variations / genetics
  • Genome / genetics
  • Genome-Wide Association Study / standards*
  • Genomics / methods
  • Genotype
  • Genotyping Techniques*
  • Haplotypes / genetics*
  • Polymorphism, Single Nucleotide / genetics
  • Software*
  • Whole Genome Sequencing
  • Zea mays / genetics

Grants and funding

TP, EGGS, CCS, HS received financial support from the German Federal Ministry of Education and Research (BMBF, https://www.bmbf.de/) via the project MAZE - “Accessing the genomic and functional diversity of maize to improve quantitative traits”; Funding ID: 031B0882). We acknowledge support by the Open Access Publication Funds of the Göttingen University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.