GFam: a platform for automatic annotation of gene families

Nucleic Acids Res. 2012 Oct;40(19):e152. doi: 10.1093/nar/gks631. Epub 2012 Jul 11.

Abstract

We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam's capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Animals
  • Arabidopsis Proteins / chemistry
  • Arabidopsis Proteins / classification
  • Arabidopsis Proteins / genetics
  • Consensus Sequence
  • Genomics / methods
  • Mice
  • Molecular Sequence Annotation*
  • Multigene Family*
  • Protein Structure, Tertiary* / genetics
  • Proteins / chemistry
  • Proteins / classification*
  • Proteins / genetics
  • Sequence Analysis, Protein
  • Software*

Substances

  • Arabidopsis Proteins
  • Proteins