Protein binding hot spots prediction from sequence only by a new ensemble learning method

Shan-Shan Hu; Peng Chen; Bing Wang; Jinyan Li

doi:10.1007/s00726-017-2474-6

Protein binding hot spots prediction from sequence only by a new ensemble learning method

Amino Acids. 2017 Oct;49(10):1773-1785. doi: 10.1007/s00726-017-2474-6. Epub 2017 Aug 1.

Authors

Shan-Shan Hu^{1

2}, Peng Chen^{3

4

5}, Bing Wang⁶, Jinyan Li⁷

Affiliations

¹ School of Computer Science and Technology, Anhui University, Hefei, 230601, Anhui, China.
² Institute of Health Sciences, Anhui University, Hefei, 230601, Anhui, China.
³ School of Computer Science and Technology, Anhui University, Hefei, 230601, Anhui, China. pchen.ustc10@yahoo.com.
⁴ Institute of Health Sciences, Anhui University, Hefei, 230601, Anhui, China. pchen.ustc10@yahoo.com.
⁵ Advanced Analytics Institute and Centre for Health Technologies, University of Technology, Sydney, Broadway, NSW, 2007, Australia. pchen.ustc10@yahoo.com.
⁶ School of Electrical and Information Engineering, Anhui University of Technology, Ma'anshan, 243032, Anhui, China.
⁷ Advanced Analytics Institute and Centre for Health Technologies, University of Technology, Sydney, Broadway, NSW, 2007, Australia.

PMID: 28766075
DOI: 10.1007/s00726-017-2474-6

Abstract

Hot spots are interfacial core areas of binding proteins, which have been applied as targets in drug design. Experimental methods are costly in both time and expense to locate hot spot areas. Recently, in-silicon computational methods have been widely used for hot spot prediction through sequence or structure characterization. As the structural information of proteins is not always solved, and thus hot spot identification from amino acid sequences only is more useful for real-life applications. This work proposes a new sequence-based model that combines physicochemical features with the relative accessible surface area of amino acid sequences for hot spot prediction. The model consists of 83 classifiers involving the IBk (Instance-based k means) algorithm, where instances are encoded by important properties extracted from a total of 544 properties in the AAindex1 (Amino Acid Index) database. Then top-performance classifiers are selected to form an ensemble by a majority voting technique. The ensemble classifier outperforms the state-of-the-art computational methods, yielding an F1 score of 0.80 on the benchmark binding interface database (BID) test set.

Availability: http://www2.ahu.edu.cn/pchen/web/HotspotEC.htm .

Keywords: Ensemble system; Hot spot residue; IBk.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Databases, Protein*
Machine Learning*
Models, Molecular*
Sequence Analysis, Protein / methods*

Abstract

Publication types

MeSH terms

Grants and funding