EPGA: de novo assembly using the distributions of reads and insert size

Junwei Luo; Jianxin Wang; Zhen Zhang; Fang-Xiang Wu; Min Li; Yi Pan

doi:10.1093/bioinformatics/btu762

EPGA: de novo assembly using the distributions of reads and insert size

Bioinformatics. 2015 Mar 15;31(6):825-33. doi: 10.1093/bioinformatics/btu762. Epub 2014 Nov 17.

Authors

Junwei Luo¹, Jianxin Wang², Zhen Zhang², Fang-Xiang Wu², Min Li², Yi Pan²

Affiliations

¹ School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.
² School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.

PMID: 25406329
DOI: 10.1093/bioinformatics/btu762

Abstract

Motivation: In genome assembly, the primary issue is how to determine upstream and downstream sequence regions of sequence seeds for constructing long contigs or scaffolds. When extending one sequence seed, repetitive regions in the genome always cause multiple feasible extension candidates which increase the difficulty of genome assembly. The universally accepted solution is choosing one based on read overlaps and paired-end (mate-pair) reads. However, this solution faces difficulties with regard to some complex repetitive regions. In addition, sequencing errors may produce false repetitive regions and uneven sequencing depth leads some sequence regions to have too few or too many reads. All the aforementioned problems prohibit existing assemblers from getting satisfactory assembly results.

Results: In this article, we develop an algorithm, called extract paths for genome assembly (EPGA), which extracts paths from De Bruijn graph for genome assembly. EPGA uses a new score function to evaluate extension candidates based on the distributions of reads and insert size. The distribution of reads can solve problems caused by sequencing errors and short repetitive regions. Through assessing the variation of the distribution of insert size, EPGA can solve problems introduced by some complex repetitive regions. For solving uneven sequencing depth, EPGA uses relative mapping to evaluate extension candidates. On real datasets, we compare the performance of EPGA and other popular assemblers. The experimental results demonstrate that EPGA can effectively obtain longer and more accurate contigs and scaffolds.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Bacteria / genetics
Genome, Bacterial*
Repetitive Sequences, Nucleic Acid / genetics*
Sequence Analysis, DNA / methods*