Improving hazard characterization in microbial risk assessment using next generation sequencing data and machine learning: Predicting clinical outcomes in shigatoxigenic Escherichia coli

Patrick Murigu Kamau Njage; Pimlapas Leekitcharoenphon; Tine Hald

doi:10.1016/j.ijfoodmicro.2018.11.016

Improving hazard characterization in microbial risk assessment using next generation sequencing data and machine learning: Predicting clinical outcomes in shigatoxigenic Escherichia coli

Int J Food Microbiol. 2019 Mar 2:292:72-82. doi: 10.1016/j.ijfoodmicro.2018.11.016. Epub 2018 Dec 4.

Authors

Patrick Murigu Kamau Njage¹, Pimlapas Leekitcharoenphon², Tine Hald²

Affiliations

¹ Research Group for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Kemitorvet, Building 204, 2800 Kgs. Lyngby, Denmark. Electronic address: panj@food.dtu.dk.
² Research Group for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Kemitorvet, Building 204, 2800 Kgs. Lyngby, Denmark.

PMID: 30579059
DOI: 10.1016/j.ijfoodmicro.2018.11.016

Abstract

The ever decreasing cost and increase in throughput of next generation sequencing (NGS) techniques have resulted in a rapid increase in availability of NGS data. Such data have the potential for rapid, reproducible and highly discriminative characterization of pathogens. This provides an opportunity in microbial risk assessment to account for variations in survivability and virulence among strains. A major challenge towards such attempts remains the highly dimensional nature of genomic data versus the number of isolates. Machine learning-based (ML) predictive risk modelling provides a solution to this "curse of dimensionality" while accounting for individual effects that are dependent on interactions with other genetic and environmental factors. This pilot study explores the potential of ML in the prediction of health endpoints resulting from shigatoxigenic E. coli (STEC) infection. Accessory genes in amino acid sequences were used as model input to predict and differentiate health outcomes in STEC infections including diarrhea, bloody diarrhea, hemolytic uremic syndrome and their combinations. Outcomes severity was also distinguished by hospitalization. A matrix of percent similarity between accessory genes and the E. coli genomes was generated and subsequently used as input for ML. The performances of ML algorithms random forest, support vector machine (radial and linear kernel), gradient boosting, and logit boost were compared. Logit boost was the best model showing an outcome prediction accuracy of 0.75 (95% CI: 0.60, 0.86), an excellent or substantial performance (Kappa = 0.72). Important genetic predictors of riskier STEC clinical outcomes included proteins involved in initial attachment to the host cell, persistence of plasmids or genomic islands, conjugative plasmid transfer and formation of sex pili, regulation of locus of enterocyte effacement expression, post-translational acetylation of proteins, facilitation of the rearrangement or deletion of sections within the pathogenic islands and transport macromolecules across the cell envelope. We propose further studies are proposed on the proteins with undefined or unclear functionality. One protein family in particular predicted HUS outcome. Toxin-antitoxin systems are potential stress adaptation markers which may mediate environmental persistence of strains in diverse sources. We foresee the application of ML approach to the set-up of real-time online analysis of whole genome sequence data to estimate the human health risk at the population or strain level. The ML approach is envisaged to support the prediction of more specific STEC clinical endpoints type by inputting isolate sequence data.

Keywords: Hazard characterization; Hazard identification; Infection outcome; Logit boost; Risk characterization; STEC; Whole genome sequencing.

MeSH terms

Adolescent
Adult
Aged
Child
Diarrhea / microbiology
Diarrhea / therapy*
Escherichia coli Infections / epidemiology
Escherichia coli Infections / therapy*
Escherichia coli Proteins / genetics
Genomics / methods
Hemolytic-Uremic Syndrome / microbiology
Hemolytic-Uremic Syndrome / therapy*
High-Throughput Nucleotide Sequencing
Humans
Machine Learning
Middle Aged
Models, Theoretical
Phylogeny
Pilot Projects
Plasmids / genetics
Risk Assessment / methods
Shiga-Toxigenic Escherichia coli / genetics*
Shiga-Toxigenic Escherichia coli / isolation & purification
Treatment Outcome
Virulence / genetics
Virulence Factors / genetics
Whole Genome Sequencing
Young Adult

Substances

Escherichia coli Proteins
Virulence Factors