A pan-genome data structure induced by pooled sequencing facilitates variant mining in heterogeneous germplasm

Mol Breed. 2022 Jun 25;42(7):36. doi: 10.1007/s11032-022-01308-6. eCollection 2022 Jul.

Abstract

Valuable genetic variation lies unused in gene banks due to the difficulty of exploiting heterogeneous germplasm accessions. Advances in molecular breeding, including transgenics and genome editing, present the opportunity to exploit hidden sequence variation directly. Here we describe the pan-genome data structure induced by whole-genome sequencing of pooled individuals from wild populations of Patellifolia spp., a source of disease resistance genes for the related crop species sugar beet (Beta vulgaris). We represent the pan-genome as a map of reads from pooled sequencing of a heterogeneous population sample to a reference genome, plus a BLAST data base of the mapped reads. We show that this basic data structure can be queried by reference genome position or homology to identify sequence variants present in the wild relative, at genes of agronomic interest in the crop, a process known as allele or variant mining. Further we demonstrate the possibility of cataloging variants in all Patellifolia genomic regions that have corresponding single copy orthologous regions in sugar beet. The data structure, termed a "pooled read archive," can be produced, altered, and queried using standard tools to facilitate discovery of agronomically-important sequence variation.

Supplementary information: The online version contains supplementary material available at 10.1007/s11032-022-01308-6.

Keywords: Bioinformatics; Crop wild relatives; Domestication; Genome editing; Haplotype; Phasing; Sequence variant.