A novel feature extraction scheme with ensemble coding for protein-protein interaction prediction

Int J Mol Sci. 2014 Jul 18;15(7):12731-49. doi: 10.3390/ijms150712731.

Abstract

Protein-protein interactions (PPIs) play key roles in most cellular processes, such as cell metabolism, immune response, endocrine function, DNA replication, and transcription regulation. PPI prediction is one of the most challenging problems in functional genomics. Although PPI data have been increasing because of the development of high-throughput technologies and computational methods, many problems are still far from being solved. In this study, a novel predictor was designed by using the Random Forest (RF) algorithm with the ensemble coding (EC) method. To reduce computational time, a feature selection method (DX) was adopted to rank the features and search the optimal feature combination. The DXEC method integrates many features and physicochemical/biochemical properties to predict PPIs. On the Gold Yeast dataset, the DXEC method achieves 67.2% overall precision, 80.74% recall, and 70.67% accuracy. On the Silver Yeast dataset, the DXEC method achieves 76.93% precision, 77.98% recall, and 77.27% accuracy. On the human dataset, the prediction accuracy reaches 80% for the DXEC-RF method. We extended the experiment to a bigger and more realistic dataset that maintains 50% recall on the Yeast All dataset and 80% recall on the Human All dataset. These results show that the DXEC method is suitable for performing PPI prediction. The prediction service of the DXEC-RF classifier is available at http://ailab.ahu.edu.cn:8087/ DXECPPI/index.jsp.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Humans
  • Protein Binding
  • Proteins / chemistry*
  • Proteins / metabolism
  • Sensitivity and Specificity
  • Sequence Analysis, Protein / methods
  • Software*
  • Yeasts / chemistry
  • Yeasts / metabolism

Substances

  • Proteins