GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species

Genome Biol. 2023 Apr 17;24(1):76. doi: 10.1186/s13059-023-02906-z.

Abstract

Whole -genome sequencing projects of millions of subjects contain enormous genotypes, entailing a huge memory burden and time for computation. Here, we present GBC, a toolkit for rapidly compressing large-scale genotypes into highly addressable byte-encoding blocks under an optimized parallel framework. We demonstrate that GBC is up to 1000 times faster than state-of-the-art methods to access and manage compressed large-scale genotypes while maintaining a competitive compression ratio. We also showed that conventional analysis would be substantially sped up if built on GBC to access genotypes of a large population. GBC's data structure and algorithms are valuable for accelerating large-scale genomic research.

Keywords: Byte-encoding genotypes; Cloud computation; Genotype compression; Genotype management; Highly addressable genotype blocks; Large-scale genotypes; Parallelization algorithm.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Data Compression* / methods
  • Genomics / methods
  • Genotype
  • Humans
  • Software*