Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree

PLoS One. 2017 Aug 8;12(8):e0181426. doi: 10.1371/journal.pone.0181426. eCollection 2017.

Abstract

Nowadays a number of computational approaches have been developed to effectively and accurately predict protein interactions. However, most of these methods typically perform worse when other biological data sources (e.g., protein structure information, protein domains, or gene neighborhoods information) are not available. In the present work, we propose a method for predicting protein interactions making full use of physicochemical characteristics of amino acids. A protein sequence is encoded at multi-scale by seven properties, including their qualitative and quantitative descriptions, of amino acids. Five kinds of protein descriptors, frequency, composition, transformation, distribution and auto covariance, are extracted from these encodings for representing each protein sequence. The new formed feature representation consisted of 347 dimensions is able to capture not only the compositional and positional information but also their statistical significance of amino acids in the sequence. Based on such a feature representation, the gradient boosting decision tree algorithm is introduced to predict protein interaction class. When the proposed method is tested with the PPI data of S.cerevisiae, it achieves a prediction accuracy of 95.28% at the Matthew's correlation coefficient of 90.68%. Compared with the state-of-the-art works on H.pylori and Human, the accuracies can be raised to 89.27% and 98.00% respectively. Extensive experiments are performed for a crossover protein-protein interactions network and the prediction accuracies are also very promising. Because of learning capabilities of the gradient boosting decision tree and the mutil-scale feature representation scheme, the proposed method might be a useful tool for future proteomics studies.

Publication types

  • Evaluation Study

MeSH terms

  • Amino Acid Sequence*
  • Bacterial Proteins / genetics
  • Bacterial Proteins / metabolism
  • Computational Biology
  • Datasets as Topic
  • Decision Trees*
  • Helicobacter pylori
  • Humans
  • Protein Interaction Mapping / methods*
  • Saccharomyces cerevisiae
  • Saccharomyces cerevisiae Proteins / genetics
  • Saccharomyces cerevisiae Proteins / metabolism
  • Wnt Proteins / genetics
  • Wnt Proteins / metabolism

Substances

  • Bacterial Proteins
  • Saccharomyces cerevisiae Proteins
  • Wnt Proteins

Grants and funding

This work was supported by: National Natural Science Foundation of China (61930007), URL: http://www.nsfc.gov.cn/ (XG); National High Technology Research and Development Program of China (863 Program) (2015BA3005). URL: http://www.most.gov.cn/eng/programmes1/ (XG); and National 973 Program (2013CB32930X), URL: http://www.most.gov.cn/eng/programmes1/200610/t20061009_36223.htm (XG). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.