Predictive model and risk analysis for peripheral vascular disease in type 2 diabetes mellitus patients using machine learning and shapley additive explanation

Front Endocrinol (Lausanne). 2024 Feb 28:15:1320335. doi: 10.3389/fendo.2024.1320335. eCollection 2024.

Abstract

Background: Peripheral vascular disease (PVD) is a common complication in patients with type 2 diabetes mellitus (T2DM). Early detection or prediction the risk of developing PVD is important for clinical decision-making.

Purpose: This study aims to establish and validate PVD risk prediction models and perform risk factor analysis for PVD in patients with T2DM using machine learning and Shapley Additive Explanation(SHAP) based on electronic health records.

Methods: We retrospectively analyzed the data from 4,372 inpatients with diabetes in a hospital between January 1, 2021, and March 28, 2023. The data comprised demographic characteristics, discharge diagnoses and biochemical index test results. After data preprocessing and feature selection using Recursive Feature Elimination(RFE), the dataset was split into training and testing sets at a ratio of 8:2, with the Synthetic Minority Over-sampling Technique(SMOTE) employed to balance the training set. Six machine learning(ML) algorithms, including decision tree (DT), logistic regression (LR), random forest (RF), support vector machine(SVM),extreme gradient boosting (XGBoost) and Adaptive Boosting(AdaBoost) were applied to construct PVD prediction models. A grid search with 10-fold cross-validation was conducted to optimize the hyperparameters. Metrics such as accuracy, precision, recall, F1-score, G-mean, and the area under the receiver operating characteristic curve (AUC) assessed the models' effectiveness. The SHAP method interpreted the best-performing model.

Results: RFE identified the optimal 12 predictors. The XGBoost model outperformed other five ML models, with an AUC of 0.945, G-mean of 0.843, accuracy of 0.890, precision of 0.930, recall of 0.927, and F1-score of 0.928. The feature importance of ML models and SHAP results indicated that Hemoglobin (Hb), age, total bile acids (TBA) and lipoprotein(a)(LP-a) are the top four important risk factors for PVD in T2DM.

Conclusion: The machine learning approach successfully developed a PVD risk prediction model with good performance. The model identified the factors associated with PVD and offered physicians an intuitive understanding on the impact of key features in the model.

Keywords: machine learning; peripheral vascular disease; predictive model; risk factor; shapley additive explanation; type 2 diabetes mellitus.

MeSH terms

  • Algorithms
  • Diabetes Mellitus, Type 2* / complications
  • Humans
  • Retrospective Studies
  • Risk Assessment
  • Risk Factors

Grants and funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This study was supported by the Natural Science Foundation of Hainan Province(Nos. 821QN0895 and 821MS044) and Research Foundation for Advanced Talents of Hainan(No.820RC649).