FunSAV: predicting the functional effect of single amino acid variants using a two-stage random forest model

PLoS One. 2012;7(8):e43847. doi: 10.1371/journal.pone.0043847. Epub 2012 Aug 24.

Abstract

Single amino acid variants (SAVs) are the most abundant form of known genetic variations associated with human disease. Successful prediction of the functional impact of SAVs from sequences can thus lead to an improved understanding of the underlying mechanisms of why a SAV may be associated with certain disease. In this work, we constructed a high-quality structural dataset that contained 679 high-quality protein structures with 2,048 SAVs by collecting the human genetic variant data from multiple resources and dividing them into two categories, i.e., disease-associated and neutral variants. We built a two-stage random forest (RF) model, termed as FunSAV, to predict the functional effect of SAVs by combining sequence, structure and residue-contact network features with other additional features that were not explored in previous studies. Importantly, a two-step feature selection procedure was proposed to select the most important and informative features that contribute to the prediction of disease association of SAVs. In cross-validation experiments on the benchmark dataset, FunSAV achieved a good prediction performance with the area under the curve (AUC) of 0.882, which is competitive with and in some cases better than other existing tools including SIFT, SNAP, Polyphen2, PANTHER, nsSNPAnalyzer and PhD-SNP. The sourcecodes of FunSAV and the datasets can be downloaded at http://sunflower.kuicr.kyoto-u.ac.jp/sjn/FunSAV.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acids / genetics*
  • Amino Acids / metabolism
  • Computational Biology*
  • Genetic Variation*
  • Genome, Human
  • Humans
  • Models, Molecular
  • Protein Conformation
  • Sequence Analysis, Protein / methods*

Substances

  • Amino Acids

Grants and funding

This work was supported by grants from the National Health and Medical Research Council of Australia (NHMRC), the Australian Research Council (ARC), the Japan Society for the Promotion of Science (JSPS), the Hundred Talents Program of the Chinese Academy of Sciences (CAS), the Knowledge Innovation Program of CAS (No. KSCX2-EW-G-8) and Tianjin Municipal Science & Technology Commission (No. 10ZCKFSY05600). JS is an NHMRC Peter Doherty Fellow and the Recipient of the Hundred Talents Program of CAS and the JSPS Short-term Invitation Fellowship to the Bioinformatics Center, Kyoto University, Japan. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.