A sequence alignment-independent method for protein classification

John K Vries; Rajan Munshi; Dror Tobi; Judith Klein-Seetharaman; Panayiotis V Benos; Ivet Bahar

doi:10.2165/00822942-200403020-00008

A sequence alignment-independent method for protein classification

Appl Bioinformatics. 2004;3(2-3):137-48. doi: 10.2165/00822942-200403020-00008.

Authors

John K Vries¹, Rajan Munshi, Dror Tobi, Judith Klein-Seetharaman, Panayiotis V Benos, Ivet Bahar

Affiliation

¹ Department of Molecular Genetics and Biochemistry, School of Medicine, Center for Computational Biology and Bioinformatics, University of Pittsburgh, 200 Lothrop Street, Pittsburgh, PA 15213, USA. vries@ccbb.pitt.edu

PMID: 15693739
DOI: 10.2165/00822942-200403020-00008

Abstract

Annotation of the rapidly accumulating body of sequence data relies heavily on the detection of remote homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (20(4)) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4-grams in common between an unknown and the best matching probe correlated with functional motifs from PRINTS. The results showed that remote homologues and functional motifs could be identified from an analysis of 4-gram patterns.

Publication types

Evaluation Study
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Amino Acid Motifs
Amino Acid Sequence
Conserved Sequence
Models, Chemical
Models, Molecular
Models, Statistical
Molecular Sequence Data
Pattern Recognition, Automated / methods*
Proteins / analysis
Proteins / chemistry*
Proteins / classification*
Sequence Alignment
Sequence Analysis, Protein / methods*
Sequence Homology, Amino Acid

Substances

Proteins