Protein length distribution is remarkably uniform across the tree of life

Genome Biol. 2023 Jun 8;24(1):135. doi: 10.1186/s13059-023-02973-2.

Abstract

Background: In every living species, the function of a protein depends on its organization of structural domains, and the length of a protein is a direct reflection of this. Because every species evolved under different evolutionary pressures, the protein length distribution, much like other genomic features, is expected to vary across species but has so far been scarcely studied.

Results: Here we evaluate this diversity by comparing protein length distribution across 2326 species (1688 bacteria, 153 archaea, and 485 eukaryotes). We find that proteins tend to be on average slightly longer in eukaryotes than in bacteria or archaea, but that the variation of length distribution across species is low, especially compared to the variation of other genomic features (genome size, number of proteins, gene length, GC content, isoelectric points of proteins). Moreover, most cases of atypical protein length distribution appear to be due to artifactual gene annotation, suggesting the actual variation of protein length distribution across species is even smaller.

Conclusions: These results open the way for developing a genome annotation quality metric based on protein length distribution to complement conventional quality measures. Overall, our findings show that protein length distribution between living species is more uniform than previously thought. Furthermore, we also provide evidence for a universal selection on protein length, yet its mechanism and fitness effect remain intriguing open questions.

Keywords: Comparative genomics; Genome annotation; Genome evolution; Protein length.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence
  • Archaea
  • Bacteria
  • Eukaryota
  • Molecular Sequence Annotation* / methods
  • Proteins* / chemistry
  • Proteins* / classification
  • Proteome
  • Sequence Analysis, Protein* / methods

Substances

  • Proteins
  • Proteome