Optimal seed solver: optimizing seed selection in read mapping

Hongyi Xin; Sunny Nahar; Richard Zhu; John Emmons; Gennady Pekhimenko; Carl Kingsford; Can Alkan; Onur Mutlu

doi:10.1093/bioinformatics/btv670

Optimal seed solver: optimizing seed selection in read mapping

Bioinformatics. 2016 Jun 1;32(11):1632-42. doi: 10.1093/bioinformatics/btv670. Epub 2015 Nov 14.

Authors

Hongyi Xin¹, Sunny Nahar¹, Richard Zhu¹, John Emmons², Gennady Pekhimenko¹, Carl Kingsford³, Can Alkan⁴, Onur Mutlu⁵

Affiliations

¹ Computer Science Department.
² Department of Computer Science and Engineering, Washington University, St. Louis, MO 63130, USA.
³ Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
⁴ Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey and.
⁵ Computer Science Department, Department of Electrical and Computer Engineering.

Abstract

Motivation: Optimizing seed selection is an important problem in read mapping. The number of non-overlapping seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both of which limit the ability of the mapper in selecting less frequent seeds to speed up the mapping process. Therefore, it is crucial to develop a new algorithm that can adjust both the individual seed length and the seed placement, as well as derive less frequent seeds.

Results: We present the Optimal Seed Solver (OSS), a dynamic programming algorithm that discovers the least frequently-occurring set of x seeds in an L-base-pair read in [Formula: see text] operations on average and in [Formula: see text] operations in the worst case, while generating a maximum of [Formula: see text] seed frequency database lookups. We compare OSS against four state-of-the-art seed selection schemes and observe that OSS provides a 3-fold reduction in average seed frequency over the best previous seed selection optimizations.

Availability and implementation: We provide an implementation of the Optimal Seed Solver in C++ at: https://github.com/CMU-SAFARI/Optimal-Seed-Solver

Contact: hxin@cmu.edu, calkan@cs.bilkent.edu.tr or onur@cmu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

Algorithms*