An explainable machine learning framework for lung cancer hospital length of stay prediction

Belal Alsinglawi; Osama Alshari; Mohammed Alorjani; Omar Mubin; Fady Alnajjar; Mauricio Novoa; Omar Darwish

doi:10.1038/s41598-021-04608-7

An explainable machine learning framework for lung cancer hospital length of stay prediction

Sci Rep. 2022 Jan 12;12(1):607. doi: 10.1038/s41598-021-04608-7.

Authors

Belal Alsinglawi¹, Osama Alshari², Mohammed Alorjani³, Omar Mubin¹, Fady Alnajjar⁴, Mauricio Novoa⁵, Omar Darwish⁶

Affiliations

¹ School of Computer, Data and Mathematical Sciences, Western Sydney University, Rydalmere, 2116, NSW, Australia.
² Oncology Division, Department of Internal Medicine, Faculty of Medicine, Jordan University of Science and Technology, Irbid, Jordan.
³ Department of Pathology and Microbiology, Faculty of Medicine, Jordan University of Science and Technology, Irbid, Jordan.
⁴ College of Information Technology, UAE University, Al-Ain, UAE. fady.alnajjar@uaeu.ac.ae.
⁵ The School of Engineering, Design and Built Environment, Western Sydney University, Rydalmere, 2116, NSW, Australia.
⁶ Department of Information Security and Applied Computing, Eastern Michigan University, Michigan, 48197, USA.

Abstract

This work introduces a predictive Length of Stay (LOS) framework for lung cancer patients using machine learning (ML) models. The framework proposed to deal with imbalanced datasets for classification-based approaches using electronic healthcare records (EHR). We have utilized supervised ML methods to predict lung cancer inpatients LOS during ICU hospitalization using the MIMIC-III dataset. Random Forest (RF) Model outperformed other models and achieved predicted results during the three framework phases. With clinical significance features selection, over-sampling methods (SMOTE and ADASYN) achieved the highest AUC results (98% with CI 95%: 95.3-100%, and 100% respectively). The combination of Over-sampling and under-sampling achieved the second-highest AUC results (98%, with CI 95%: 95.3-100%, and 97%, CI 95%: 93.7-100% SMOTE-Tomek, and SMOTE-ENN respectively). Under-sampling methods reported the least important AUC results (50%, with CI 95%: 40.2-59.8%) for both (ENN and Tomek- Links). Using ML explainable technique called SHAP, we explained the outcome of the predictive model (RF) with SMOTE class balancing technique to understand the most significant clinical features that contributed to predicting lung cancer LOS with the RF model. Our promising framework allows us to employ ML techniques in-hospital clinical information systems to predict lung cancer admissions into ICU.

Publication types

Comparative Study
Evaluation Study

MeSH terms

Humans
Length of Stay*
Lung Neoplasms*
Machine Learning*