Claims-based algorithms for common chronic conditions were efficiently constructed using machine learning methods

Konan Hara; Yasuki Kobayashi; Jun Tomio; Yuki Ito; Thomas Svensson; Ryo Ikesu; Ung-Il Chung; Akiko Kishi Svensson

doi:10.1371/journal.pone.0254394

Claims-based algorithms for common chronic conditions were efficiently constructed using machine learning methods

PLoS One. 2021 Sep 27;16(9):e0254394. doi: 10.1371/journal.pone.0254394. eCollection 2021.

Authors

Konan Hara¹, Yasuki Kobayashi¹, Jun Tomio¹, Yuki Ito², Thomas Svensson^{3

4

5}, Ryo Ikesu^{1

3}, Ung-Il Chung^{3

5

6}, Akiko Kishi Svensson^{3

4

7}

Affiliations

¹ Department of Public Health, Graduate School of Medicine, The University of Tokyo, Bunkyo-ku, Tokyo, Japan.
² Department of Economics, University of California, Berkeley, Berkeley, California, United States of America.
³ Precision Health, Department of Bioengineering, Graduate School of Engineering, The University of Tokyo, Bunkyo-ku, Tokyo, Japan.
⁴ Department of Clinical Sciences, Lund University, Skåne University Hospital, Malmö, Sweden.
⁵ School of Health Innovation, Kanagawa University of Human Services, Kawasaki-shi, Kanagawa, Japan.
⁶ Clinical Biotechnology, Center for Disease Biology and Integrative Medicine, Graduate School of Medicine, The University of Tokyo, Bunkyo-ku, Tokyo, Japan.
⁷ Department of Diabetes and Metabolic Diseases, Graduate School of Medicine, The University of Tokyo, Bunkyo-ku, Tokyo, Japan.

Abstract

Identification of medical conditions using claims data is generally conducted with algorithms based on subject-matter knowledge. However, these claims-based algorithms (CBAs) are highly dependent on the knowledge level and not necessarily optimized for target conditions. We investigated whether machine learning methods can supplement researchers' knowledge of target conditions in building CBAs. Retrospective cohort study using a claims database combined with annual health check-up results of employees' health insurance programs for fiscal year 2016-17 in Japan (study population for hypertension, N = 631,289; diabetes, N = 152,368; dyslipidemia, N = 614,434). We constructed CBAs with logistic regression, k-nearest neighbor, support vector machine, penalized logistic regression, tree-based model, and neural network for identifying patients with three common chronic conditions: hypertension, diabetes, and dyslipidemia. We then compared their association measures using a completely hold-out test set (25% of the study population). Among the test cohorts of 157,822, 38,092, and 153,608 enrollees for hypertension, diabetes, and dyslipidemia, 25.4%, 8.4%, and 38.7% of them had a diagnosis of the corresponding condition. The areas under the receiver operating characteristic curve (AUCs) of the logistic regression with/without subject-matter knowledge about the target condition were .923/.921 for hypertension, .957/.938 for diabetes, and .739/.747 for dyslipidemia. The logistic lasso, logistic elastic-net, and tree-based methods yielded AUCs comparable to those of the logistic regression with subject-matter knowledge: .923-.931 for hypertension; .958-.966 for diabetes; .747-.773 for dyslipidemia. We found that machine learning methods can attain AUCs comparable to the conventional knowledge-based method in building CBAs.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Chronic Disease
Databases, Factual*
Diabetes Mellitus / diagnosis*
Dyslipidemias / diagnosis*
Female
Humans
Hypertension / diagnosis*
Insurance Claim Review / statistics & numerical data*
Machine Learning*
Male
Middle Aged
Neural Networks, Computer
Retrospective Studies
Support Vector Machine

Grants and funding

This research is supported by the Center of Innovation Program from Japan Science and Technology Agency, JST. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.