Evaluation of stacked ensemble model performance to predict clinical outcomes: A COVID-19 study

Int J Med Inform. 2023 Jul:175:105090. doi: 10.1016/j.ijmedinf.2023.105090. Epub 2023 May 8.

Abstract

Background: The application of machine learning (ML) to analyze clinical data with the goal to predict patient outcomes has garnered increasing attention. Ensemble learning has been used in conjunction with ML to improve predictive performance. Although stacked generalization (stacking), a type of heterogeneous ensemble of ML models, has emerged in clinical data analysis, it remains unclear how to define the best model combinations for strong predictive performance. This study develops a methodology to evaluate the performance of "base" learner models and their optimized combination using "meta" learner models in stacked ensembles to accurately assess performance in the context of clinical outcomes.

Methods: De-identified COVID-19 data was obtained from the University of Louisville Hospital, where a retrospective chart review was performed from March 2020 to November 2021. Three differently-sized subsets using features from the overall dataset were chosen to train and evaluate ensemble classification performance. The number of base learners chosen from several algorithm families coupled with a complementary meta learner was varied from a minimum of 2 to a maximum of 8. Predictive performance of these combinations was evaluated in terms of mortality and severe cardiac event outcomes using area-under-the-receiver-operating-characteristic (AUROC), F1, balanced accuracy, and kappa.

Results: The results highlight the potential to accurately predict clinical outcomes, such as severe cardiac events with COVID-19, from routinely acquired in-hospital patient data. Meta learners Generalized Linear Model (GLM), Multi-Layer Perceptron (MLP), and Partial Least Squares (PLS) had the highest AUROC for both outcomes, while K-Nearest Neighbors (KNN) had the lowest. Performance trended lower in the training set as the number of features increased, and exhibited less variance in both training and validation across all feature subsets as the number of base learners increased.

Conclusion: This study offers a methodology to robustly evaluate ensemble ML performance when analyzing clinical data.

Keywords: COVID-19; Clinical data analysis; Machine learning; Meta learners; Stacked ensemble; Stacked generalization.

MeSH terms

  • Algorithms
  • COVID-19*
  • Humans
  • Machine Learning
  • Neural Networks, Computer
  • Retrospective Studies