Predicting early-onset COPD risk in adults aged 20-50 using electronic health records and machine learning

PeerJ. 2024 Feb 23:12:e16950. doi: 10.7717/peerj.16950. eCollection 2024.

Abstract

Chronic obstructive pulmonary disease (COPD) is a major public health concern, affecting estimated 164 million people worldwide. Early detection and intervention strategies are essential to reduce the burden of COPD, but current screening approaches are limited in their ability to accurately predict risk. Machine learning (ML) models offer promise for improved accuracy of COPD risk prediction by combining genetic and electronic medical record data. In this study, we developed and evaluated eight ML models for primary screening of COPD utilizing routine screening data, polygenic risk scores (PRS), additional clinical data, or a combination of all three. To assess our models, we conducted a retrospective analysis of approximately 329,396 patients in the UK Biobank database. Incorporating personal information and blood biochemical test results significantly improved the model's accuracy for predicting COPD risk, achieving a best performance of 0.8505 AUC, a specificity of 0.8539 and a sensitivity of 0.7584. These results indicate that ML models can be effectively utilized for accurate prediction of COPD risk in individuals aged 20 to 50 years, providing a valuable tool for early detection and intervention.

Keywords: COPD; Chronic obstructive pulmonary disease; Early-onset; Electronic health records; Genetic data; Machine learning; Polygenic risk scores; Risk prediction; UK Biobank.

MeSH terms

  • Adult
  • Databases, Factual
  • Electronic Health Records*
  • Humans
  • Machine Learning
  • Pulmonary Disease, Chronic Obstructive* / diagnosis
  • Retrospective Studies

Grants and funding

This work has received funding and technical support from the Ailurus Biotechnology Co., Ltd. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.