Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information

IEEE/ACM Trans Comput Biol Bioinform. 2012 Nov-Dec;9(6):1766-75. doi: 10.1109/TCBB.2012.106.

Abstract

The recognition of DNA-binding residues in proteins is critical to our understanding of the mechanisms of DNA-protein interactions, gene expression, and for guiding drug design. Therefore, a prediction method DNABR (DNA Binding Residues) is proposed for predicting DNA-binding residues in protein sequences using the random forest (RF) classifier with sequence-based features. Two types of novel sequence features are proposed in this study, which reflect the information about the conservation of physicochemical properties of the amino acids, and the correlation of amino acids between different sequence positions in terms of physicochemical properties. The first type of feature uses the evolutionary information combined with the conservation of physicochemical properties of the amino acids while the second reflects the dependency effect of amino acids with regards to polarity charge and hydrophobic properties in the protein sequences. Those two features and an orthogonal binary vector which reflect the characteristics of 20 types of amino acids are used to build the DNABR, a model to predict DNA-binding residues in proteins. The DNABR model achieves a value of 0.6586 for Matthew’s correlation coefficient (MCC) and 93.04 percent overall accuracy (ACC) with a68.47 percent sensitivity (SE) and 98.16 percent specificity (SP), respectively. The comparisons with each feature demonstrate that these two novel features contribute most to the improvement in predictive ability. Furthermore, performance comparisons with other approaches clearly show that DNABR has an excellent prediction performance for detecting binding residues in putative DNA-binding protein. The DNABR web-server system is freely available at http://www.cbi.seu.edu.cn/DNABR/.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acids
  • Computational Biology / methods*
  • DNA / chemistry*
  • DNA / metabolism*
  • DNA-Binding Proteins / chemistry*
  • DNA-Binding Proteins / metabolism*
  • Databases, Protein
  • Humans
  • Internet
  • Models, Molecular
  • Models, Statistical*
  • Protein Binding
  • ROC Curve
  • Sequence Analysis, Protein / methods*
  • Support Vector Machine

Substances

  • Amino Acids
  • DNA-Binding Proteins
  • DNA