Predicting the Onset of Diabetes with Machine Learning Methods

J Pers Med. 2023 Feb 24;13(3):406. doi: 10.3390/jpm13030406.

Abstract

The number of people suffering from diabetes in Taiwan has continued to rise in recent years. According to the statistics of the International Diabetes Federation, about 537 million people worldwide (10.5% of the global population) suffer from diabetes, and it is estimated that 643 million people will develop the condition (11.3% of the total population) by 2030. If this trend continues, the number will jump to 783 million (12.2%) by 2045. At present, the number of people with diabetes in Taiwan has reached 2.18 million, with an average of one in ten people suffering from the disease. In addition, according to the Bureau of National Health Insurance in Taiwan, the prevalence rate of diabetes among adults in Taiwan has reached 5% and is increasing each year. Diabetes can cause acute and chronic complications that can be fatal. Meanwhile, chronic complications can result in a variety of disabilities or organ decline. If holistic treatments and preventions are not provided to diabetic patients, it will lead to the consumption of more medical resources and a rapid decline in the quality of life of society as a whole. In this study, based on the outpatient examination data of a Taipei Municipal medical center, 15,000 women aged between 20 and 80 were selected as the subjects. These women were patients who had gone to the medical center during 2018-2020 and 2021-2022 with or without the diagnosis of diabetes. This study investigated eight different characteristics of the subjects, including the number of pregnancies, plasma glucose level, diastolic blood pressure, sebum thickness, insulin level, body mass index, diabetes pedigree function, and age. After sorting out the complete data of the patients, this study used Microsoft Machine Learning Studio to train the models of various kinds of neural networks, and the prediction results were used to compare the predictive ability of the various parameters for diabetes. Finally, this study found that after comparing the models using two-class logistic regression as well as the two-class neural network, two-class decision jungle, or two-class boosted decision tree for prediction, the best model was the two-class boosted decision tree, as its area under the curve could reach a score of 0.991, which was better than other models.

Keywords: F1 score; area under the curve; artificial neural network; confusion matrix; deep learning; machine learning; recall; receiver operator characteristic; supervised learning.

Publication types

  • Review

Grants and funding

The APC was funded by the Research Center for Healthcare Industry Innovation, National Taipei University of Nursing and Health Sciences, Taipei 112, Taiwan.