Application of multi-label classification models for the diagnosis of diabetic complications

Liang Zhou; Xiaoyuan Zheng; Di Yang; Ying Wang; Xuesong Bai; Xinhua Ye

doi:10.1186/s12911-021-01525-7

Application of multi-label classification models for the diagnosis of diabetic complications

BMC Med Inform Decis Mak. 2021 Jun 7;21(1):182. doi: 10.1186/s12911-021-01525-7.

Authors

Liang Zhou^#¹, Xiaoyuan Zheng^#¹, Di Yang², Ying Wang¹, Xuesong Bai³, Xinhua Ye⁴

Affiliations

¹ Department of Endocrinology, Changzhou No.2 People's Hospital Affiliated to Nanjing Medical University, 29 Xinglongxiang Road, Changzhou City, 213000, Jiangsu Province, China.
² Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China.
³ Capital Medical University, Beijing, 100053, China.
⁴ Department of Endocrinology, Changzhou No.2 People's Hospital Affiliated to Nanjing Medical University, 29 Xinglongxiang Road, Changzhou City, 213000, Jiangsu Province, China. czyxh2000@163.com.

^# Contributed equally.

Abstract

Background: Early diagnosis for the diabetes complications is clinically demanding with great significancy. Regarding the complexity of diabetes complications, we applied a multi-label classification (MLC) model to predict four diabetic complications simultaneously using data in the modern electronic health records (EHRs), and leveraged the correlations between the complications to further improve the prediction accuracy.

Methods: We obtained the demographic characteristics and laboratory data from the EHRs for patients admitted to Changzhou No. 2 People's Hospital, the affiliated hospital of Nanjing Medical University in China from May 2013 to June 2020. The data included 93 biochemical indicators and 9,765 patients. We used the Pearson correlation coefficient (PCC) to analyze the correlations between different diabetic complications from a statistical perspective. We used an MLC model, based on the Random Forest (RF) technique, to leverage these correlations and predict four complications simultaneously. We explored four different MLC models; a Label Power Set (LP), Classifier Chains (CC), Ensemble Classifier Chains (ECC), and Calibrated Label Ranking (CLR). We used traditional Binary Relevance (BR) as a comparison. We used 11 different performance metrics and the area under the receiver operating characteristic curve (AUROC) to evaluate these models. We analyzed the weights of the learned model and illustrated (1) the top 10 key indicators of different complications and (2) the correlations between different diabetic complications.

Results: The MLC models including CC, ECC and CLR outperformed the traditional BR method in most performance metrics; the ECC models performed the best in Hamming loss (0.1760), Accuracy (0.7020), F1_Score (0.7855), Precision (0.8649), F1_micro (0.8078), F1_macro (0.7773), Recall_micro (0.8631), Recall_macro (0.8009), and AUROC (0.8231). The two diabetic complication correlation matrices drawn from the PCC analysis and the MLC models were consistent with each other and indicated that the complications correlated to different extents. The top 10 key indicators given by the model are valuable in medical application.

Conclusions: Our MLC model can effectively utilize the potential correlation between different diabetic complications to further improve the prediction accuracy. This model should be explored further in other complex diseases with multiple complications.

Keywords: Correlation; Diabetic complication; Key indicators; Machine learning; Multi-label classification.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

China
Delivery of Health Care
Diabetes Complications* / diagnosis
Diabetes Mellitus* / diagnosis
Electronic Health Records
Humans
ROC Curve