Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid-Binding Proteins

Wenchuan Wang; Robert Langlois; Marina Langlois; Georgi Z Genchev; Xiaolei Wang; Hui Lu

doi:10.3389/fgene.2019.00729

Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid-Binding Proteins

Front Genet. 2019 Aug 30:10:729. doi: 10.3389/fgene.2019.00729. eCollection 2019.

Authors

Wenchuan Wang¹, Robert Langlois², Marina Langlois², Georgi Z Genchev^{1

2

3}, Xiaolei Wang^{1

4}, Hui Lu^{1

2

5}

Affiliations

¹ SJTU-Yale Joint Center for Biostatistics and Data Science, Department of Bioinformatics and Biostatistics, College of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, Chinas.
² Department of Bioengineering and Department of Computer Science, University of Illinois at Chicago, Chicago, IL, United States.
³ Bulgarian Institute for Genomics and Precision Medicine, Sofia, Bulgaria.
⁴ Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China.
⁵ Center for Biomedical Informatics, Shanghai Children's Hospital, Shanghai, China.

Abstract

Function annotation efforts provide a foundation to our understanding of cellular processes and the functioning of the living cell. This motivates high-throughput computational methods to characterize new protein members of a particular function. Research work has focused on discriminative machine-learning methods, which promise to make efficient, de novo predictions of protein function. Furthermore, available function annotation exists predominantly for individual proteins rather than residues of which only a subset is necessary for the conveyance of a particular function. This limits discriminative approaches to predicting functions for which there is sufficient residue-level annotation, e.g., identification of DNA-binding proteins or where an excellent global representation can be divined. Complete understanding of the various functions of proteins requires discovery and functional annotation at the residue level. Herein, we cast this problem into the setting of multiple-instance learning, which only requires knowledge of the protein's function yet identifies functionally relevant residues and need not rely on homology. We developed a new multiple-instance leaning algorithm derived from AdaBoost and benchmarked this algorithm against two well-studied protein function prediction tasks: annotating proteins that bind DNA and RNA. This algorithm outperforms certain previous approaches in annotating protein function while identifying functionally relevant residues involved in binding both DNA and RNA, and on one protein-DNA benchmark, it achieves near perfect classification.

Keywords: DNA binding proteins; RNA binding proteins; decision trees; machine learning; multiple-instance learning; protein function annotation; protein sequence and structural analysis; semi supervised learning.