Error tolerant indexing and alignment of short reads with covering template families

Eldar Giladi; John Healy; Gene Myers; Chris Hart; Philipp Kapranov; Doron Lipson; Steve Roels; Edward Thayer; Stan Letovsky

doi:10.1089/cmb.2010.0005

Error tolerant indexing and alignment of short reads with covering template families

J Comput Biol. 2010 Oct;17(10):1397-1411. doi: 10.1089/cmb.2010.0005.

Authors

Eldar Giladi¹, John Healy, Gene Myers, Chris Hart, Philipp Kapranov, Doron Lipson, Steve Roels, Edward Thayer, Stan Letovsky

Affiliation

¹ Helicos BioSciences Corporation, Cambridge, Massachusetts 02139, USA. egiladi@helicosbio.com

PMID: 20937014
DOI: 10.1089/cmb.2010.0005

Abstract

The rapid adoption of high-throughput next generation sequence data in biological research is presenting a major challenge for sequence alignment tools—specifically, the efficient alignment of vast amounts of short reads to large references in the presence of differences arising from sequencing errors and biological sequence variations. To address this challenge, we developed a short read aligner for high-throughput sequencer data that is tolerant of errors or mutations of all types—namely, substitutions, deletions, and insertions. The aligner utilizes a multi-stage approach in which template-based indexing is used to identify candidate regions for alignment with dynamic programming. A template is a pair of gapped seeds, with one used with the read and one used with the reference. In this article, we focus on the development of template families that yield error-tolerant indexing up to a given error-budget. A general algorithm for finding those families is presented, and a recursive construction that creates families with higher error tolerance from ones with a lower error tolerance is developed.

MeSH terms

Algorithms
Base Sequence
Molecular Sequence Data
Sequence Alignment*
Sequence Analysis, DNA*
Software
Templates, Genetic*