qPMS9: an efficient algorithm for quorum Planted Motif Search

Sci Rep. 2015 Jan 15:5:7813. doi: 10.1038/srep07813.

Abstract

Discovering patterns in biological sequences is a crucial problem. For example, the identification of patterns in DNA sequences has resulted in the determination of open reading frames, identification of gene promoter elements, intron/exon splicing sites, and SH RNAs, location of RNA degradation signals, identification of alternative splicing sites, etc. In protein sequences, patterns have led to domain identification, location of protease cleavage sites, identification of signal peptides, protein interactions, determination of protein degradation elements, identification of protein trafficking elements, discovery of short functional motifs, etc. In this paper we focus on the identification of an important class of patterns, namely, motifs. We study the (ℓ, d) motif search problem or Planted Motif Search (PMS). PMS receives as input n strings and two integers ℓ and d. It returns all sequences M of length ℓ that occur in each input string, where each occurrence differs from M in at most d positions. Another formulation is quorum PMS (qPMS), where the motif appears in at least q% of the strings. We introduce qPMS9, a parallel exact qPMS algorithm that offers significant runtime improvements on DNA and protein datasets. qPMS9 solves the challenging DNA (ℓ, d)-instances (28, 12) and (30, 13). The source code is available at https://code.google.com/p/qpms9/.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • DNA / chemistry
  • Internet
  • Proteins / chemistry
  • Software*

Substances

  • Proteins
  • DNA