Clustering sequences by overlap

Dietmar H Dorr; Anne M Denton

doi:10.1504/ijdmb.2009.026701

Clustering sequences by overlap

Int J Data Min Bioinform. 2009;3(3):260-79. doi: 10.1504/ijdmb.2009.026701.

Authors

Dietmar H Dorr¹, Anne M Denton

Affiliation

¹ Department of Computer Science, North Dakota State University, Fargo, ND 58105, USA. dietmar.dorr@ndsu.edu

PMID: 19623770
DOI: 10.1504/ijdmb.2009.026701

Abstract

A clustering algorithm is introduced that combines the strengths of clustering and motif finding techniques. Clusters are identified based on unambiguously defined sequence sections as in motif finding algorithms. The definition of similarity within clusters allows transitive matches and, thereby, enables the discovery of remote homologies that cannot be found through motif-finding algorithms. Directed Acyclic Graph (DAG) structures are constructed that link short clusters to the longer ones. We compare the clustering results to the corresponding domains in the InterPro database. A second comparison shows that annotations based on our domains are inherently more consistent than those based on InterPro domains.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Cluster Analysis*
Pattern Recognition, Automated*
Sequence Alignment / methods
Sequence Analysis, DNA / methods