Memory efficient assembly of human genome

J Bioinform Comput Biol. 2015 Apr;13(2):1550008. doi: 10.1142/S0219720015500080. Epub 2014 Dec 17.

Abstract

The ability to detect the genetic variations between two individuals is an essential component for genetic studies. In these studies, obtaining the genome sequence of both individuals is the first step toward variation detection problem. The emergence of high-throughput sequencing (HTS) technology has made DNA sequencing practical, and is widely used by diagnosticians to increase their knowledge about the casual factor in genetic related diseases. As HTS advances, more data are generated every day than the amount that scientists can process. Genome assembly is one of the existing methods to tackle the variation detection problem. The de Bruijn graph formulation of the assembly problem is widely used in the field. Furthermore, it is the only method which can assemble any genome in linear time. However, it requires an enormous amount of memory in order to assemble any mammalian size genome. The high demands of sequencing more individuals and the urge to assemble them are the driving forces for a memory efficient assembler. In this work, we propose a novel method which builds the de Bruijn graph while consuming lower memory. Moreover, our proposed method can reduce the memory usage by 37% compared to the existing methods. In addition, we used a real data set (chromosome 17 of A/J strain) to illustrate the performance of our method.

Keywords: De Bruijn graph; Genome assembly; high-throughput sequencing; local assembly.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Animals
  • Computational Biology
  • Computer Simulation
  • Genome, Human*
  • High-Throughput Nucleotide Sequencing / statistics & numerical data*
  • Humans
  • Mice
  • Mice, Inbred A / genetics
  • Mutation
  • Polymorphism, Single Nucleotide
  • Sequence Alignment / statistics & numerical data
  • Sequence Analysis, DNA / statistics & numerical data
  • Software