Identification of DNA-Binding Proteins Using Mixed Feature Representation Methods

Kaiyang Qu; Ke Han; Song Wu; Guohua Wang; Leyi Wei

doi:10.3390/molecules22101602

Identification of DNA-Binding Proteins Using Mixed Feature Representation Methods

Molecules. 2017 Sep 22;22(10):1602. doi: 10.3390/molecules22101602.

Authors

Kaiyang Qu¹, Ke Han², Song Wu³, Guohua Wang⁴, Leyi Wei^{5

6}

Affiliations

¹ School of Computer Science and Technology, Tianjin University, Tianjin 300350, China. nyqky257248@163.com.
² School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China. hanke@hrbcu.edu.cn.
³ Center of Potential Illness, Qinhuangdao Hospital of Traditional Chinese Medicine, Qinhuangdao 066001, China. suran.tju@gmail.com.
⁴ School of Computer Science and Technology, Harbin Institute of China, Harbin 150001, China. ghwang@hit.edu.cn.
⁵ School of Computer Science and Technology, Tianjin University, Tianjin 300350, China. weileyi@tju.edu.cn.
⁶ State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin 300074, China. weileyi@tju.edu.cn.

Abstract

DNA-binding proteins play vital roles in cellular processes, such as DNA packaging, replication, transcription, regulation, and other DNA-associated activities. The current main prediction method is based on machine learning, and its accuracy mainly depends on the features extraction method. Therefore, using an efficient feature representation method is important to enhance the classification accuracy. However, existing feature representation methods cannot efficiently distinguish DNA-binding proteins from non-DNA-binding proteins. In this paper, a multi-feature representation method, which combines three feature representation methods, namely, K-Skip-N-Grams, Information theory, and Sequential and structural features (SSF), is used to represent the protein sequences and improve feature representation ability. In addition, the classifier is a support vector machine. The mixed-feature representation method is evaluated using 10-fold cross-validation and a test set. Feature vectors, which are obtained from a combination of three feature extractions, show the best performance in 10-fold cross-validation both under non-dimensional reduction and dimensional reduction by max-relevance-max-distance. Moreover, the reduced mixed feature method performs better than the non-reduced mixed feature technique. The feature vectors, which are a combination of SSF and K-Skip-N-Grams, show the best performance in the test set. Among these methods, mixed features exhibit superiority over the single features.

Keywords: DNA-binding protein; mixed feature representation methods; support vector machine.

MeSH terms

Amino Acid Sequence
Computational Biology / methods
DNA / chemistry
DNA-Binding Proteins / metabolism*
Machine Learning
Support Vector Machine

Substances

DNA-Binding Proteins
DNA