PREHOST: Host prediction of coronaviridae family using machine learning

Heliyon. 2023 Feb;9(2):e13646. doi: 10.1016/j.heliyon.2023.e13646. Epub 2023 Feb 11.

Abstract

Coronavirus, a zoonotic virus capable of transmitting infections from animals to humans, emerged as a pandemic recently. In such circumstances, it is essential to understand the virus's origin. In this study, we present a novel machine-learning pipeline PreHost for host prediction of the family, Coronaviridae. We leverage the complete viral genome and sequences at the protein level (spike protein, membrane protein, and nucleocapsid protein). Compared with the current state-of-the-art approaches, the random forest model attained high accuracy and recall scores of 99.91% and 0.98, respectively, for genome sequences. In addition to the spike protein sequences, our study shows membrane and nucleocapsid protein sequences can be utilized to predict the host of viruses. We also identified important sites in the viral sequences that help distinguish between different host classes. The host prediction pipeline PreHost will cater as a valuable tool to take effective measures to govern the transmission of future viruses.

Keywords: Biological sequences; Coronaviridae; Feature identification; Host specificity; Random forest; SARS-CoV-2; Zoonosis.