Computation of rank and select functions on hierarchical binary string and its application to genome mapping problems for short-read DNA sequences

J Comput Biol. 2009 Nov;16(11):1601-13. doi: 10.1089/cmb.2008.0146.

Abstract

Abstract We have developed efficient in-practice algorithms for computing rank and select functions on a binary string, based on a novel data structure, a hierarchical binary string with hierarchical accumulatives. It efficiently stores decomposed information on partial summations over various scales of subregions of a given binary string, so that the required space overhead ratio is only about 3.5% irrespective of the string length. Values of rank and select functions are computed hierarchically in [(log(2)n)/8] iterations, where n is the string length. For example, for an unbiased random binary string of 64 G bits, each value of these functions can be computed in about a microsecond, on average, on a single 3.0-GHz CPU using 8+ GB of memory. We also present their applications to genome mapping problems for large-scale short-read DNA sequence data, especially produced by ultra-high-throughput new-generation DNA sequencers. The algorithms are applied to the binarization of the Burrows-Wheeler transform of the human genome DNA sequence. For the sake of high-speed performance, we adopted a somewhat stringent mapping condition that allows at most a single-base mismatch (either a substitution, insertion, or deletion of a single base) per query sequence. An experimentally implemented program mapped several thousands of sequences per second on a single 3.0-GHz CPU, several times faster than ELAND, a widely used mapping program with the Illumina-Solexa 1G analyser.

MeSH terms

  • Algorithms
  • Base Sequence
  • Chromosome Mapping / methods*
  • Computational Biology / methods*
  • Genome, Human / genetics*
  • Humans
  • Oligonucleotide Array Sequence Analysis
  • Time Factors