Clustering sequences by overlap

Int J Data Min Bioinform. 2009;3(3):260-79. doi: 10.1504/ijdmb.2009.026701.

Abstract

A clustering algorithm is introduced that combines the strengths of clustering and motif finding techniques. Clusters are identified based on unambiguously defined sequence sections as in motif finding algorithms. The definition of similarity within clusters allows transitive matches and, thereby, enables the discovery of remote homologies that cannot be found through motif-finding algorithms. Directed Acyclic Graph (DAG) structures are constructed that link short clusters to the longer ones. We compare the clustering results to the corresponding domains in the InterPro database. A second comparison shows that annotations based on our domains are inherently more consistent than those based on InterPro domains.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • Cluster Analysis*
  • Pattern Recognition, Automated*
  • Sequence Alignment / methods
  • Sequence Analysis, DNA / methods