Sequence assembly using next generation sequencing data--challenges and solutions

Francis Y L Chin; Henry C M Leung; S M Yiu

doi:10.1007/s11427-014-4752-9

Sequence assembly using next generation sequencing data--challenges and solutions

Sci China Life Sci. 2014 Nov;57(11):1140-8. doi: 10.1007/s11427-014-4752-9. Epub 2014 Oct 17.

Authors

Francis Y L Chin¹, Henry C M Leung, S M Yiu

Affiliation

¹ Department of Computer Science, The University of Hong Kong, Hong Kong, China, chin@cs.hku.hk.

PMID: 25326069
DOI: 10.1007/s11427-014-4752-9

Abstract

Sequence assembling is an important step for bioinformatics study. With the help of next generation sequencing (NGS) technology, high throughput DNA fragment (reads) can be randomly sampled from DNA or RNA molecular sequence. However, as the positions of reads being sampled are unknown, assembling process is required for combining overlapped reads to reconstruct the original DNA or RNA sequence. Compared with traditional Sanger sequencing methods, although the throughput of NGS reads increases, the read length is shorter and the error rate is higher. It introduces several problems in assembling. Moreover, paired-end reads instead of single-end reads can be sampled which contain more information. The existing assemblers cannot fully utilize this information and fails to assemble longer contigs. In this article, we will revisit the major problems of assembling NGS reads on genomic, transcriptomic, metagenomic and metatranscriptomic data. We will also describe our IDBA package for solving these problems. IDBA package has adopted several novel ideas in assembling, including using multiple k, local assembling and progressive depth removal. Compared with existence assemblers, IDBA has better performance on many simulated and real sequencing datasets.

Publication types

Research Support, Non-U.S. Gov't
Review

MeSH terms

Algorithms
Computational Biology / methods*
Contig Mapping / methods
DNA / chemistry*
Escherichia coli / genetics
False Positive Reactions
Genome
Genome, Bacterial
Humans
Lactobacillus plantarum / genetics
Metagenomics
RNA / chemistry*
Sequence Analysis, DNA / methods*
Software
Transcription, Genetic
Transcriptome

Substances

RNA
DNA