The enzymatic nature of an anonymous protein sequence cannot reliably be inferred from superfamily level structural information alone

Protein Sci. 2015 May;24(5):643-50. doi: 10.1002/pro.2635. Epub 2015 Jan 28.

Abstract

As the largest fraction of any proteome does not carry out enzymatic functions, and in order to leverage 3D structural data for the annotation of increasingly higher volumes of sequence data, we wanted to assess the strength of the link between coarse grained structural data (i.e., homologous superfamily level) and the enzymatic versus non-enzymatic nature of protein sequences. To probe this relationship, we took advantage of 41 phylogenetically diverse (encompassing 11 distinct phyla) genomes recently sequenced within the GEBA initiative, for which we integrated structural information, as defined by CATH, with enzyme level information, as defined by Enzyme Commission (EC) numbers. This analysis revealed that only a very small fraction (about 1%) of domain sequences occurring in the analyzed genomes was found to be associated with homologous superfamilies strongly indicative of enzymatic function. Resorting to less stringent criteria to define enzyme versus non-enzyme biased structural classes or excluding highly prevalent folds from the analysis had only modest effect on this proportion. Thus, the low genomic coverage by structurally anchored protein domains strongly associated to catalytic activities indicates that, on its own, the power of coarse grained structural information to infer the general property of being an enzyme is rather limited.

Keywords: CATH; GEBA; enzyme genomics; enzymes; fold innovability; fold plasticity; genome and metagenome annotation; homologous superfamily; protein function; protein structure; structural genomics; structure-function relationship.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence / genetics*
  • Archaea / chemistry
  • Archaea / genetics
  • Bacteria / chemistry
  • Bacteria / genetics
  • Genome, Archaeal
  • Genome, Bacterial
  • Protein Conformation*
  • Protein Folding
  • Protein Structure, Tertiary / genetics
  • Sequence Homology, Amino Acid
  • Structure-Activity Relationship*