Performance analysis and prediction of type 2 diabetes mellitus based on lifestyle data using machine learning approaches

J Diabetes Metab Disord. 2022 Mar 14;21(1):339-352. doi: 10.1007/s40200-022-00981-w. eCollection 2022 Jun.

Abstract

Objective: Diabetes is a chronic fatal disease that has affected millions of people all over the globe. Type 2 Diabetes Mellitus (T2DM) accounts for 90% of the affected population among all types of diabetes. Millions of T2DM patients remain undiagnosed due to lack of awareness and under resourced healthcare system. So, there is a dire need for a diagnostic and prognostic tool that shall help the healthcare providers, clinicians and practitioners with early prediction and hence can recommend the lifestyle changes required to stop the progression of diabetes. The main objective of this research is to develop a framework based on machine learning techniques using only lifestyle indicators for prediction of T2DM disease. Moreover, prediction model can be used without visiting clinical labs and hospital readmissions.

Method: A proposed framework is presented and implemented based on machine learning paradigms using lifestyle indicators for better prediction of T2DM disease. The current research has involved different experts like Diabetologists, Endocrinologists, Dieticians, Nutritionists, etc. for selecting the contributing 1552 instances and 11 attributes lifestyle biological features to promote health and manage complications towards T2DM disease. The dataset has been collected through survey and google forms from different geographical regions.

Results: Seven machine learning classifiers were employed namely K-Nearest Neighbour (KNN), Linear Regression (LR), Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT), Random Forest (RF) and Gradient Boosting (GB). Gradient Boosting classifier outperformed best with an accuracy rate of 97.24% for training and 96.90% for testing separately followed by RF, DT, NB, SVM, LR, and KNN as 95.36%, 92.52%, 90.72%, 90.20%, 90.20% and 77.06% respectively. However, in terms of precision, RF achieved high performance (0.980%) and KNN performed the lowest (0.793%). As far as recall is being concerned, GB achieved the highest rate of 0.975% and KNN showed the worst rate of 0.774%. Also, GB is top performed in terms of f1-score. According to the ROCs, GB and NB had a better area under the curve compared to the others.

Conclusion: The research developed a realistic health management system for T2DM disease based on machine learning techniques using only lifestyle data for prediction of T2DM. To extend the current study, these models shall be used for different, large and real-time datasets which share the commonality of data with T2DM disease to establish the efficacy of the proposed system.

Keywords: Diabetes prediction; Type 2 diabetes mellitus; ensemble learning model; jupyter notebook; machine learning algorithms.