Comparison between linear regression and four different machine learning methods in selecting risk factors for osteoporosis in a Chinese female aged cohort

J Chin Med Assoc. 2023 Nov 1;86(11):1028-1036. doi: 10.1097/JCMA.0000000000000999. Epub 2023 Sep 19.

Abstract

Background: Population aging is emerging as an increasingly acute challenge for countries around the world. One particular manifestation of this phenomenon is the impact of osteoporosis on individuals and national health systems. Previous studies of risk factors for osteoporosis were conducted using traditional statistical methods, but more recent efforts have turned to machine learning approaches. Most such efforts, however, treat the target variable (bone mineral density [BMD] or fracture rate) as a categorical one, which provides no quantitative information. The present study uses five different machine learning methods to analyze the risk factors for T-score of BMD, seeking to (1) compare the prediction accuracy between different machine learning methods and traditional multiple linear regression (MLR) and (2) rank the importance of 25 different risk factors.

Methods: The study sample includes 24 412 women older than 55 years with 25 related variables, applying traditional MLR and five different machine learning methods: classification and regression tree, Naïve Bayes, random forest, stochastic gradient boosting, and eXtreme gradient boosting. The metrics used for model performance comparisons are the symmetric mean absolute percentage error, relative absolute error, root relative squared error, and root mean squared error.

Results: Machine learning approaches outperformed MLR for all four prediction errors. The average importance ranking of each factor generated by the machine learning methods indicates that age is the most important factor determining T-score, followed by estimated glomerular filtration rate (eGFR), body mass index (BMI), uric acid (UA), and education level.

Conclusion: In a group of women older than 55 years, we demonstrated that machine learning methods provide superior performance in estimating T-Score, with age being the most important impact factor, followed by eGFR, BMI, UA, and education level.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bayes Theorem
  • East Asian People* / statistics & numerical data
  • Female
  • Humans
  • Linear Models*
  • Machine Learning*
  • Middle Aged
  • Osteoporosis* / epidemiology
  • Risk Assessment* / methods
  • Risk Factors
  • Taiwan / epidemiology