A correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction

Ahmed Hassan Ibrahim; Onur Can Karabulut; Betül Asiye Karpuzcu; Erdem Türk; Barış Ethem Süzek

doi:10.1371/journal.pone.0285168

A correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction

PLoS One. 2023 May 2;18(5):e0285168. doi: 10.1371/journal.pone.0285168. eCollection 2023.

Authors

Ahmed Hassan Ibrahim¹, Onur Can Karabulut¹, Betül Asiye Karpuzcu¹, Erdem Türk^{1

2}, Barış Ethem Süzek^{1

2

3}

Affiliations

¹ Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey.
² Department of Computer Engineering, Faculty of Engineering, Muğla Sıtkı Koçman University, Muğla, Turkey.
³ Georgetown University Medical Center, Biochemistry and Molecular & Cellular Biology, Washington DC, United States of America.

Abstract

Prediction of virus-host protein-protein interactions (PPI) is a broad research area where various machine-learning-based classifiers are developed. Transforming biological data into machine-usable features is a preliminary step in constructing these virus-host PPI prediction tools. In this study, we have adopted a virus-host PPI dataset and a reduced amino acids alphabet to create tripeptide features and introduced a correlation coefficient-based feature selection. We applied feature selection across several correlation coefficient metrics and statistically tested their relevance in a structural context. We compared the performance of feature-selection models against that of the baseline virus-host PPI prediction models created using different classification algorithms without the feature selection. We also tested the performance of these baseline models against the previously available tools to ensure their predictive power is acceptable. Here, the Pearson coefficient provides the best performance with respect to the baseline model as measured by AUPR; a drop of 0.003 in AUPR while achieving a 73.3% (from 686 to 183) reduction in the number of tripeptides features for random forest. The results suggest our correlation coefficient-based feature selection approach, while decreasing the computation time and space complexity, has a limited impact on the prediction performance of virus-host PPI prediction tools.

Copyright: © 2023 Ibrahim et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Machine Learning
Random Forest*

Grants and funding

This project is supported by The Scientific and Technological Research Council of Turkey (https://www.tubitak.gov.tr/en) under grant number 119E664 awarded to BES. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.