A novel complexity measure for comparative analysis of protein sequences from complete genomes

Tannistha Nandi; Debasis Dash; Rohit Ghai; Chandrika B-Rao; K Kannan; Samir K Brahmachari; C Ramakrishnan; Srinivasan Ramachandran

doi:10.1080/07391102.2003.10506882

A novel complexity measure for comparative analysis of protein sequences from complete genomes

J Biomol Struct Dyn. 2003 Apr;20(5):657-68. doi: 10.1080/07391102.2003.10506882.

Authors

Tannistha Nandi¹, Debasis Dash, Rohit Ghai, Chandrika B-Rao, K Kannan, Samir K Brahmachari, C Ramakrishnan, Srinivasan Ramachandran

Affiliation

¹ Institute of Genomics and Integrative Biology, Centre for Biochemical Technology, Mall Road, Delhi 110 007, India.

PMID: 12643768
DOI: 10.1080/07391102.2003.10506882

Abstract

Analysis of sequence complexities of proteins is an important step in the characterization and classification of new genomes. A new measure has been proposed to compute sequence complexity in protein sequences based on linguistic complexity. The algorithm requires a single parameter, is computationally simple and provides a framework for comparative genomic analysis. Protein sequences were classified into groups of high or low complexity based on a quantitative measure termed F(c), which is proportional to the fraction of low complexity sequence present in the protein. The algorithm was tested on sequences of 196 non-homologous proteins whose crystal structures are available at </=2.0 A resolution. Protein sequences of high complexity had 'globular' structures (95% agreement), whereas those of low complexity had non-globular structures (80% agreement). Application of this measure to proteins of unknown structure/function from different genomes revealed that the sequences of high complexity constitute the majority in all genomes (about 90% in Archaea, about 93% in Eubacteria, 89% in Saccharomyces cerevisiae and 90% in Caenorhabditis elegans). Aeropyrum pernix among Archaeae and Deinococcus radiodurans among Eubacteria have the lowest fraction of high complexity proteins (75% and 80% respectively). Further, it was observed that a few bacterial pathogens (Mycobacterium tuberculosis, Pseudomonas aeruginosa) have high fraction of low complexity proteins. The program ScanCom is available from the authors as a PERL script (UNIX system).

Publication types

Comparative Study
Evaluation Study
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Amino Acid Sequence
Animals
Bacteria / genetics
Computational Biology
Crenarchaeota / genetics
Databases, Protein
Genome
Molecular Sequence Data
Proteins / classification
Repetitive Sequences, Amino Acid
Sequence Analysis, Protein / methods*

Substances

Proteins