Small-Sample Estimation of the Mutational Support and Distribution of SARS-CoV-2

IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):668-682. doi: 10.1109/TCBB.2022.3165395. Epub 2023 Feb 3.

Abstract

We consider the problem of determining the mutational support and distribution of the SARS-CoV-2 viral genome in the small-sample regime. The mutational support refers to the unknown number of sites that may eventually mutate in the SARS-CoV-2 genome while mutational distribution refers to the distribution of point mutations in the viral genome across a population. The mutational support may be used to assess the virulence of the virus and guide primer selection for real-time RT-PCR testing. Estimating the distribution of mutations in the genome of different subpopulations while accounting for the unseen may also aid in discovering new variants. To estimate the mutational support in the small-sample regime, we use GISAID sequencing data and our state-of-the-art polynomial estimation techniques based on new weighted and regularized Chebyshev approximation methods. For distribution estimation, we adapt the well-known Good-Turing estimator. Our analysis reveals several findings: First, the mutational supports exhibit significant differences in the ORF6 and ORF7a regions (older versus younger patients), ORF1b and ORF10 regions (females versus males) and in almost all ORFs (Asia/Europe/North America). Second, even though the N region of SARS-CoV-2 has a predicted 10% mutational support, mutations fall outside of the primer regions recommended by the CDC.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • COVID-19* / genetics
  • Female
  • Genome, Viral / genetics
  • Humans
  • Male
  • Mutation / genetics
  • Point Mutation
  • SARS-CoV-2* / genetics

Grants and funding

This work was supported by the NSF CCBGM Center at University of Illinois and NSF under Grant CIF2107344.