Exploring the predictive capability of advanced machine learning in identifying severe disease phenotype in Salmonella enterica

Food Res Int. 2022 Jan:151:110817. doi: 10.1016/j.foodres.2021.110817. Epub 2021 Nov 22.

Abstract

The past few years have seen a significant increase in availability of whole genome sequencing information, allowing for its incorporation in predictive modeling for foodborne pathogens to account for inter- and intra-species differences in their virulence. However, this is hindered by the inability of traditional statistical methods to analyze such large amounts of data compared to the number of observations/isolates. In this study, we have explored the applicability of machine learning (ML) models to predict the disease outcome, while identifying features that exert a significant effect on the prediction. This study was conducted on Salmonella enterica, a major foodborne pathogen with considerable inter- and intra-serovar variation. WGS of isolates obtained from various sources (i.e., human, chicken, and swine) were used as input in four machine learning models (logistic regression with ridge, random forest, support vector machine, and AdaBoost) to classify isolates based on disease severity (extraintestinal vs. gastrointestinal) in the host. The predictive performances of all models were tested with and without Elastic Net regularization to combat dimensionality issues. Elastic Net-regularized logistic regression model showed the best area under the receiver operating characteristic curve (AUC-ROC; 0.86) and outcome prediction accuracy (0.76). Additionally, genes coding for transcriptional regulation, acidic, oxidative, and anaerobic stress response, and antibiotic resistance were found to be significant predictors of disease severity. These genes, which were significantly associated with each outcome, could possibly be input in amended, gene-expression-specific predictive models to estimate virulence pattern-specific effect of Salmonella and other foodborne pathogens on human health.

Keywords: Disease phenotypes; Disease severity; Food safety; Machine learning; Predictive modeling; Salmonella.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Animals
  • Machine Learning
  • Phenotype
  • Salmonella / genetics
  • Salmonella enterica* / genetics
  • Swine
  • Whole Genome Sequencing