Machine Learning Methods for Predicting Human-Adaptive Influenza A Viruses Based on Viral Nucleotide Compositions

Mol Biol Evol. 2020 Apr 1;37(4):1224-1236. doi: 10.1093/molbev/msz276.

Abstract

Each influenza pandemic was caused at least partly by avian- and/or swine-origin influenza A viruses (IAVs). The timing of and the potential IAVs involved in the next pandemic are currently unpredictable. We aim to build machine learning (ML) models to predict human-adaptive IAV nucleotide composition. A total of 217,549 IAV full-length coding sequences of the PB2 (polymerase basic protein-2), PB1, PA (polymerase acidic protein), HA (hemagglutinin), NP (nucleoprotein), and NA (neuraminidase) segments were decomposed for their codon position-based mononucleotides (12 nts) and dinucleotides (48 dnts). A total of 68,742 human sequences and 68,739 avian sequences (1:1) were resampled to characterize the human adaptation-associated (d)nts with principal component analysis (PCA) and other ML models. Then, the human adaptation of IAV sequences was predicted based on the characterized (d)nts. Respectively, 9, 12, 11, 13, 10 and 9 human-adaptive (d)nts were optimized for the six segments. PCA and hierarchical clustering analysis revealed the linear separability of the optimized (d)nts between the human-adaptive and avian-adaptive sets. The results of the confusion matrix and the area under the receiver operating characteristic curve indicated a high performance of the ML models to predict human adaptation of IAVs. Our model performed well in predicting the human adaptation of the swine/avian IAVs before and after the 2009 H1N1 pandemic. In conclusion, we identified the human adaptation-associated genomic composition of IAV segments. ML models for IAV human adaptation prediction using large IAV genomic data sets can facilitate the identification of key viral factors that affect virus transmission/pathogenicity. Most importantly, it allows the prediction of pandemic influenza.

Keywords: dinucleotide; genomic nucleotide composition; human adaptation; influenza A viruses (IAVs); machine learning (ML).

Publication types

  • Evaluation Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adaptation, Biological / genetics*
  • Host-Pathogen Interactions
  • Humans
  • Influenza A virus / genetics*
  • Machine Learning*
  • Viral Proteins / genetics*

Substances

  • Viral Proteins