Risk prediction of chronic diseases with a two-stage semi-supervised clustering method

Prev Med Rep. 2023 Feb 6:32:102129. doi: 10.1016/j.pmedr.2023.102129. eCollection 2023 Apr.

Abstract

Early detection of chronic diseases such as cardiovascular disease (CVD) and diabetes can make the difference between life and death. Previous studies have demonstrated the feasibility of disease diagnosis and prediction using machine learning and disease-indicating biomarkers. The aim of this study is to develop a method to detect the risk of future disease even when disease-indicating biomarker readings are in the normal range. Data from the US Centers for Disease Control and Prevention (CDC) National Health and Nutrition Examination Surveys (NHANES) are used for this study. A two-stage semi-supervised K-Means (SSK-Means) clustering approach was developed to identify the underlying risk of each individual and categorize them into high or low-risk groups for CVD and diabetes. Our developed method of classification can identify groups as high risk or low risk, even if they would have been considered normal using traditional biomarker threshold criteria. For CVD, the SSK-Means clustering results showed that individuals over 30 years of age in the high-risk group were almost twice as likely to develop CVD as individuals in the low-risk group. For diabetes, the SSK-Means clustering results showed that individuals over 50 years in the high-risk group have at least two times the risk of developing diabetes compared with individuals in the low-risk group.