Investigating Correlation between Protein Sequence Similarity and Semantic Similarity Using Gene Ontology Annotations

IEEE/ACM Trans Comput Biol Bioinform. 2018 May-Jun;15(3):905-912. doi: 10.1109/TCBB.2017.2695542. Epub 2017 Apr 18.

Abstract

Sequence similarity is a commonly used measure to compare proteins. With the increasing use of ontologies, semantic (function) similarity is getting importance. The correlation between these measures has been applied in the evaluation of new semantic similarity methods, and in protein function prediction. In this research, we investigate the relationship between the two similarity methods. The results suggest absence of a strong correlation between sequence and semantic similarities. There is a large number of proteins with low sequence similarity and high semantic similarity. We observe that Pearson's correlation coefficient is not sufficient to explain the nature of this relationship. Interestingly, the term semantic similarity values above 0 and below 1 do not seem to play a role in improving the correlation. That is, the correlation coefficient depends only on the number of common GO terms in proteins under comparison, and the semantic similarity measurement method does not influence it. Semantic similarity and sequence similarity have a distinct behavior. These findings are of significant effect for future works on protein comparison, and will help understand the semantic similarity between proteins in a better way.

MeSH terms

  • Amino Acid Sequence
  • Animals
  • Computational Biology / methods*
  • Databases, Protein
  • Gene Ontology*
  • Humans
  • Mice
  • Molecular Sequence Annotation / methods*
  • Proteins / chemistry*
  • Proteins / genetics
  • Semantics*
  • Sequence Alignment
  • Sequence Analysis, Protein

Substances

  • Proteins