PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation

Int J Mol Sci. 2017 Aug 25;18(9):1856. doi: 10.3390/ijms18091856.

Abstract

DNA-binding proteins play crucial roles in various biological processes, such as DNA replication and repair, transcriptional regulation and many other biological activities associated with DNA. Experimental recognition techniques for DNA-binding proteins identification are both time consuming and expensive. Effective methods for identifying these proteins only based on protein sequences are highly required. The key for sequence-based methods is to effectively represent protein sequences. It has been reported by various previous studies that evolutionary information is crucial for DNA-binding protein identification. In this study, we employed four methods to extract the evolutionary information from Position Specific Frequency Matrix (PSFM), including Residue Probing Transformation (RPT), Evolutionary Difference Transformation (EDT), Distance-Bigram Transformation (DBT), and Trigram Transformation (TT). The PSFMs were converted into fixed length feature vectors by these four methods, and then respectively combined with Support Vector Machines (SVMs); four predictors for identifying these proteins were constructed, including PSFM-RPT, PSFM-EDT, PSFM-DBT, and PSFM-TT. Experimental results on a widely used benchmark dataset PDB1075 and an independent dataset PDB186 showed that these four methods achieved state-of-the-art-performance, and PSFM-DBT outperformed other existing methods in this field. For practical applications, a user-friendly webserver of PSFM-DBT was established, which is available at http://bioinformatics.hitsz.edu.cn/PSFM-DBT/.

Keywords: DNA binding protein; PSFM; PSFM-DBT; distance bigram transformation.

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Computational Biology / methods*
  • DNA / chemistry
  • DNA / genetics
  • DNA / metabolism*
  • DNA-Binding Proteins / chemistry
  • DNA-Binding Proteins / metabolism*
  • Models, Molecular
  • Protein Binding
  • Protein Conformation
  • Protein Interaction Domains and Motifs
  • ROC Curve
  • Reproducibility of Results
  • Support Vector Machine
  • Web Browser

Substances

  • DNA-Binding Proteins
  • DNA