Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences

BMC Bioinformatics. 2017 Jun 12;18(1):300. doi: 10.1186/s12859-017-1715-8.

Abstract

Background: DNA-binding proteins perform important functions in a great number of biological activities. DNA-binding proteins can interact with ssDNA (single-stranded DNA) or dsDNA (double-stranded DNA), and DNA-binding proteins can be categorized as single-stranded DNA-binding proteins (SSBs) and double-stranded DNA-binding proteins (DSBs). The identification of DNA-binding proteins from amino acid sequences can help to annotate protein functions and understand the binding specificity. In this study, we systematically consider a variety of schemes to represent protein sequences: OAAC (overall amino acid composition) features, dipeptide compositions, PSSM (position-specific scoring matrix profiles) and split amino acid composition (SAA), and then we adopt SVM (support vector machine) and RF (random forest) classification model to distinguish SSBs from DSBs.

Results: Our results suggest that some sequence features can significantly differentiate DSBs and SSBs. Evaluated by 10 fold cross-validation on the benchmark datasets, our prediction method can achieve the accuracy of 88.7% and AUC (area under the curve) of 0.919. Moreover, our method has good performance in independent testing.

Conclusions: Using various sequence-derived features, a novel method is proposed to distinguish DSBs and SSBs accurately. The method also explores novel features, which could be helpful to discover the binding specificity of DNA-binding proteins.

Keywords: Binding specificity; DSBs (Double-stranded DNA-binding proteins); Protein sequence; SSBs (Single-stranded DNA-binding proteins).

MeSH terms

  • Amino Acid Sequence
  • Computational Biology / methods*
  • DNA / metabolism*
  • DNA, Single-Stranded / metabolism*
  • DNA-Binding Proteins* / chemistry
  • DNA-Binding Proteins* / genetics
  • DNA-Binding Proteins* / metabolism
  • Protein Binding
  • Sequence Analysis, Protein / methods*
  • Support Vector Machine

Substances

  • DNA, Single-Stranded
  • DNA-Binding Proteins
  • DNA