Core Genome Allelic Profiles of Clinical Klebsiella pneumoniae Strains Using a Random Forest Algorithm Based on Multilocus Sequence Typing Scheme for Hypervirulence Analysis

J Infect Dis. 2020 Mar 16;221(Suppl 2):S263-S271. doi: 10.1093/infdis/jiz562.

Abstract

Background: Hypervirulent Klebsiella pneumoniae (hvKP) infections can have high morbidity and mortality rates owing to their invasiveness and virulence. However, there are no effective tools or biomarkers to discriminate between hvKP and nonhypervirulent K. pneumoniae (nhvKP) strains. We aimed to use a random forest algorithm to predict hvKP based on core-genome data.

Methods: In total, 272 K. pneumoniae strains were collected from 20 tertiary hospitals in China and divided into hvKP and nhvKP groups according to clinical criteria. Clinical data comparisons, whole-genome sequencing, virulence profile analysis, and core genome multilocus sequence typing (cgMLST) were performed. We then established a random forest predictive model based on the cgMLST scheme to prospectively identify hvKP. The random forest is an ensemble learning method that generates multiple decision trees during the training process and each decision tree will output its own prediction results corresponding to the input. The predictive ability of the model was assessed by means of area under the receiver operating characteristic curve.

Results: Patients in the hvKP group were younger than those in the nhvKP group (median age, 58.0 and 68.0 years, respectively; P < .001). More patients in the hvKP group had underlying diabetes mellitus (43.1% vs 20.1%; P < .001). Clinically, carbapenem-resistant K. pneumoniae was less common in the hvKP group (4.1% vs 63.8%; P < .001), whereas the K1/K2 serotype, sequence type (ST) 23, and positive string tests were significantly higher in the hvKP group. A cgMLST-based minimal spanning tree revealed that hvKP strains were scattered sporadically within nhvKP clusters. ST23 showed greater genome diversification than did ST11, according to cgMLST-based allelic differences. Primary virulence factors (rmpA, iucA, positive string test result, and the presence of virulence plasmid pLVPK) were poor predictors of the hypervirulence phenotype. The random forest model based on the core genome allelic profile presented excellent predictive power, both in the training and validating sets (area under receiver operating characteristic curve, 0.987 and 0.999 in the training and validating sets, respectively).

Conclusions: A random forest algorithm predictive model based on the core genome allelic profiles of K. pneumoniae was accurate to identify the hypervirulent isolates.

Keywords: hypervirulent Klebsiella pneumoniae; liver abscess; multilocus sequence typing; predict; random forest.

Publication types

  • Multicenter Study
  • Observational Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Aged
  • Algorithms*
  • Bacterial Proteins / genetics
  • China
  • Female
  • Humans
  • Klebsiella Infections / microbiology*
  • Klebsiella pneumoniae / genetics*
  • Klebsiella pneumoniae / pathogenicity*
  • Male
  • Middle Aged
  • Multilocus Sequence Typing
  • Phenotype
  • Plasmids / genetics
  • Prospective Studies
  • Serogroup
  • Virulence / genetics
  • Virulence Factors / genetics*
  • Whole Genome Sequencing

Substances

  • Bacterial Proteins
  • Virulence Factors