Identification of Geographic Specific SARS-Cov-2 Mutations by Random Forest Classification and Variable Selection Methods

Stat Appl. 2020 Jul;18(1):253-268. Epub 2020 Jun 30.

Abstract

RNA viral genomes have very high mutations rates. As infection spreads in the host populations, different viral lineages emerge acquiring independent mutations that can lead to varied infection and death rates in different parts of the world. By application of Random Forest classification and feature selection methods, we developed an analysis pipeline for identification of geographic specific mutations and classification of different viral lineages, focusing on the missense-variants that alter the function of the encoded proteins. We applied the pipeline on publicly available SARS-CoV-2 datasets and demonstrated that the analysis pipeline accurately identified country or region-specific viral lineages and specific mutations that discriminate different lineages. The results presented here can help designing country-specific diagnostic strategies and prioritizing the mutations for functional interpretation and experimental validations.

Keywords: Classification; Coronavirus; Feature selection; Random forest; SARS-CoV-2.