Clustering of protein domains for functional and evolutionary studies

Pavle Goldstein; Jurica Zucko; Dusica Vujaklija; Anita Krisko; Daslav Hranueli; Paul F Long; Catherine Etchebest; Bojan Basrak; John Cullum

doi:10.1186/1471-2105-10-335

Clustering of protein domains for functional and evolutionary studies

BMC Bioinformatics. 2009 Oct 15:10:335. doi: 10.1186/1471-2105-10-335.

Authors

Pavle Goldstein¹, Jurica Zucko, Dusica Vujaklija, Anita Krisko, Daslav Hranueli, Paul F Long, Catherine Etchebest, Bojan Basrak, John Cullum

Affiliation

¹ Department of Genetics, University of Kaiserslautern, Postfach 3049, 67653 Kaiserslautern, Germany.

Abstract

Background: The number of protein family members defined by DNA sequencing is usually much larger than those characterised experimentally. This paper describes a method to divide protein families into subtypes purely on sequence criteria. Comparison with experimental data allows an independent test of the quality of the clustering.

Results: An evolutionary split statistic is calculated for each column in a protein multiple sequence alignment; the statistic has a larger value when a column is better described by an evolutionary model that assumes clustering around two or more amino acids rather than a single amino acid. The user selects columns (typically the top ranked columns) to construct a motif. The motif is used to divide the family into subtypes using a stochastic optimization procedure related to the deterministic annealing EM algorithm (DAEM), which yields a specificity score showing how well each family member is assigned to a subtype. The clustering obtained is not strongly dependent on the number of amino acids chosen for the motif. The robustness of this method was demonstrated using six well characterized protein families: nucleotidyl cyclase, protein kinase, dehydrogenase, two polyketide synthase domains and small heat shock proteins. Phylogenetic trees did not allow accurate clustering for three of the six families.

Conclusion: The method clustered the families into functional subtypes with an accuracy of 90 to 100%. False assignments usually had a low specificity score.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Cluster Analysis*
Computational Biology / methods*
Databases, Protein
Evolution, Molecular
Protein Structure, Tertiary
Proteins / chemistry*
Sequence Alignment
Sequence Analysis, Protein / methods

Substances

Proteins