Development and validation of machine learning models for nonalcoholic fatty liver disease

Hepatobiliary Pancreat Dis Int. 2023 Dec;22(6):615-621. doi: 10.1016/j.hbpd.2023.03.009. Epub 2023 Mar 25.

Abstract

Background: Nonalcoholic fatty liver disease (NAFLD) had become the most prevalent liver disease worldwide. Early diagnosis could effectively reduce NAFLD-related morbidity and mortality. This study aimed to combine the risk factors to develop and validate a novel model for predicting NAFLD.

Methods: We enrolled 578 participants completing abdominal ultrasound into the training set. The least absolute shrinkage and selection operator (LASSO) regression combined with random forest (RF) was conducted to screen significant predictors for NAFLD risk. Five machine learning models including logistic regression (LR), RF, extreme gradient boosting (XGBoost), gradient boosting machine (GBM), and support vector machine (SVM) were developed. To further improve model performance, we conducted hyperparameter tuning with train function in Python package 'sklearn'. We included 131 participants completing magnetic resonance imaging into the testing set for external validation.

Results: There were 329 participants with NAFLD and 249 without in the training set, while 96 with NAFLD and 35 without were in the testing set. Visceral adiposity index, abdominal circumference, body mass index, alanine aminotransferase (ALT), ALT/AST (aspartate aminotransferase), age, high-density lipoprotein cholesterol (HDL-C) and elevated triglyceride (TG) were important predictors for NAFLD risk. The area under curve (AUC) of LR, RF, XGBoost, GBM, SVM were 0.915 [95% confidence interval (CI): 0.886-0.937], 0.907 (95% CI: 0.856-0.938), 0.928 (95% CI: 0.873-0.944), 0.924 (95% CI: 0.875-0.939), and 0.900 (95% CI: 0.883-0.913), respectively. XGBoost model presented the best predictive performance, and its AUC was enhanced to 0.938 (95% CI: 0.870-0.950) with further parameter tuning.

Conclusions: This study developed and validated five novel machine learning models for NAFLD prediction, among which XGBoost presented the best performance and was considered a reliable reference for early identification of high-risk patients with NAFLD in clinical practice.

Keywords: Machine learning; Nonalcoholic fatty liver disease; Predictive factors.

MeSH terms

  • Alanine Transaminase
  • Area Under Curve
  • Humans
  • Machine Learning
  • Non-alcoholic Fatty Liver Disease* / diagnostic imaging
  • Risk Factors

Substances

  • Alanine Transaminase