An ensemble method with hybrid features to identify extracellular matrix proteins

PLoS One. 2015 Feb 13;10(2):e0117804. doi: 10.1371/journal.pone.0117804. eCollection 2015.

Abstract

The extracellular matrix (ECM) is a dynamic composite of secreted proteins that play important roles in numerous biological processes such as tissue morphogenesis, differentiation and homeostasis. Furthermore, various diseases are caused by the dysfunction of ECM proteins. Therefore, identifying these important ECM proteins may assist in understanding related biological processes and drug development. In view of the serious imbalance in the training dataset, a Random Forest-based ensemble method with hybrid features is developed in this paper to identify ECM proteins. Hybrid features are employed by incorporating sequence composition, physicochemical properties, evolutionary and structural information. The Information Gain Ratio and Incremental Feature Selection (IGR-IFS) methods are adopted to select the optimal features. Finally, the resulting predictor termed IECMP (Identify ECM Proteins) achieves an balanced accuracy of 86.4% using the 10-fold cross-validation on the training dataset, which is much higher than results obtained by other methods (ECMPRED: 71.0%, ECMPP: 77.8%). Moreover, when tested on a common independent dataset, our method also achieves significantly improved performance over ECMPP and ECMPRED. These results indicate that IECMP is an effective method for ECM protein prediction, which has a more balanced prediction capability for positive and negative samples. It is anticipated that the proposed method will provide significant information to fully decipher the molecular mechanisms of ECM-related biological processes and discover candidate drug targets. For public access, we develop a user-friendly web server for ECM protein identification that is freely accessible at http://iecmp.weka.cc.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • Datasets as Topic
  • Extracellular Matrix Proteins / metabolism*
  • Reproducibility of Results
  • Support Vector Machine*
  • Web Browser
  • Workflow

Substances

  • Extracellular Matrix Proteins

Grants and funding

This work was funded by the National Natural Science Foundation of China (http://www.nsfc.gov.cn; No. 61174044, 61473335, and 61174218). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.