An ensemble method for predicting subnuclear localizations from primary protein structures

PLoS One. 2013;8(2):e57225. doi: 10.1371/journal.pone.0057225. Epub 2013 Feb 27.

Abstract

Background: Predicting protein subnuclear localization is a challenging problem. Some previous works based on non-sequence information including Gene Ontology annotations and kernel fusion have respective limitations. The aim of this work is twofold: one is to propose a novel individual feature extraction method; another is to develop an ensemble method to improve prediction performance using comprehensive information represented in the form of high dimensional feature vector obtained by 11 feature extraction methods.

Methodology/principal findings: A novel two-stage multiclass support vector machine is proposed to predict protein subnuclear localizations. It only considers those feature extraction methods based on amino acid classifications and physicochemical properties. In order to speed up our system, an automatic search method for the kernel parameter is used. The prediction performance of our method is evaluated on four datasets: Lei dataset, multi-localization dataset, SNL9 dataset and a new independent dataset. The overall accuracy of prediction for 6 localizations on Lei dataset is 75.2% and that for 9 localizations on SNL9 dataset is 72.1% in the leave-one-out cross validation, 71.7% for the multi-localization dataset and 69.8% for the new independent dataset, respectively. Comparisons with those existing methods show that our method performs better for both single-localization and multi-localization proteins and achieves more balanced sensitivities and specificities on large-size and small-size subcellular localizations. The overall accuracy improvements are 4.0% and 4.7% for single-localization proteins and 6.5% for multi-localization proteins. The reliability and stability of our classification model are further confirmed by permutation analysis.

Conclusions: It can be concluded that our method is effective and valuable for predicting protein subnuclear localizations. A web server has been designed to implement the proposed method. It is freely available at http://bioinformatics.awowshop.com/snlpred_page.php.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence
  • Cell Nucleus / metabolism*
  • Databases, Protein
  • Models, Molecular
  • Protein Transport
  • Proteins / chemistry*
  • Proteins / metabolism*
  • ROC Curve
  • Reproducibility of Results
  • Sequence Analysis, Protein / methods*
  • Subcellular Fractions / metabolism
  • Support Vector Machine

Substances

  • Proteins

Grants and funding

This project was supported by the Natural Science Foundation of China (grant 11071282), the Chinese Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) (grant IRT1179), the Research Foundation of Education Commission of Hunan Province of China (grant 11A122), Hunan Provincial Natural Science Foundation of China (grant 10JJ7001), Science and Technology Planning Project of Hunan Province of China (grant 2011FJ2011), the Lotus Scholars Program of Hunan Province of China, the Aid Program for Science and Technology Innovative Research Team in Higher Educational Institutions of Hunan Province of China, and the Australian Research Council (grant DP0559807), and Hunan Provincial Postgraduate Research and Innovation Project of China (grant CX2010B243). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.