Identifying Protein Subcellular Locations With Embeddings-Based node2loc

IEEE/ACM Trans Comput Biol Bioinform. 2022 Mar-Apr;19(2):666-675. doi: 10.1109/TCBB.2021.3080386. Epub 2022 Apr 1.

Abstract

Identifying protein subcellular locations is an important topic in protein function prediction. Interacting proteins may share similar locations. Thus, it is imperative to infer protein subcellular locations by taking protein-protein interactions (PPIs)into account. In this study, we present a network embedding-based method, node2loc, to identify protein subcellular locations. node2loc first learns distributed embeddings of proteins in a protein-protein interaction (PPI)network using node2vec. Then the learned embeddings are further fed into a recurrent neural network (RNN). To resolve the severe class imbalance of different subcellular locations, Synthetic Minority Over-sampling Technique (SMOTE)is applied to artificially synthesize proteins for minority classes. node2loc is evaluated on our constructed human benchmark dataset with 16 subcellular locations and yields a Matthews correlation coefficient (MCC)value of 0.800, which is superior to baseline methods. In addition, node2loc yields a better performance on a Yeast benchmark dataset with 17 locations. The results demonstrate that the learned representations from a PPI network have certain discriminative ability for classifying protein subcellular locations. However, node2loc is a transductive method, it only works for proteins connected in a PPI network, and it needs to be retrained for new proteins. In addition, the PPI network needs be annotated to some extent with location information. node2loc is freely available at https://github.com/xypan1232/node2loc.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Humans
  • Neural Networks, Computer*
  • Protein Interaction Mapping* / methods
  • Proteins / metabolism
  • Saccharomyces cerevisiae / metabolism

Substances

  • Proteins