A multiple information fusion method for predicting subcellular locations of two different types of bacterial protein simultaneously

Biosystems. 2016 Jan:139:37-45. doi: 10.1016/j.biosystems.2015.12.002. Epub 2015 Dec 24.

Abstract

Subcellular localization prediction of bacterial protein is an important component of bioinformatics, which has great importance for drug design and other applications. For the prediction of protein subcellular localization, as we all know, lots of computational tools have been developed in the recent decades. In this study, we firstly introduce three kinds of protein sequences encoding schemes: physicochemical-based, evolutionary-based, and GO-based. The original and consensus sequences were combined with physicochemical properties. And elements information of different rows and columns in position-specific scoring matrix were taken into consideration simultaneously for more core and essence information. Computational methods based on gene ontology (GO) have been demonstrated to be superior to methods based on other features. Then principal component analysis (PCA) is applied for feature selection and reduced vectors are input to a support vector machine (SVM) to predict protein subcellular localization. The proposed method can achieve a prediction accuracy of 98.28% and 97.87% on a stringent Gram-positive (Gpos) and Gram-negative (Gneg) dataset with Jackknife test, respectively. At last, we calculate "absolute true overall accuracy (ATOA)", which is stricter than overall accuracy. The ATOA obtained from the proposed method is also up to 97.32% and 93.06% for Gpos and Gneg. From both the rationality of testing procedure and the success rates of test results, the current method can improve the prediction quality of protein subcellular localization.

Keywords: Gene ontology; Physicochemical properties; Position-specific score matrix; Principal component analysis; Support vector machine.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bacterial Proteins / metabolism*
  • Computational Biology*
  • Intracellular Space / metabolism*
  • Models, Biological
  • Principal Component Analysis
  • Support Vector Machine

Substances

  • Bacterial Proteins