ICD2Vec: Mathematical representation of diseases

J Biomed Inform. 2023 May:141:104361. doi: 10.1016/j.jbi.2023.104361. Epub 2023 Apr 11.

Abstract

Background: The International Classification of Diseases (ICD) codes represent the global standard for reporting disease conditions. The current ICD codes connote direct human-defined relationships among diseases in a hierarchical tree structure. Representing the ICD codes as mathematical vectors helps to capture nonlinear relationships in medical ontologies across diseases.

Methods: We propose a universally applicable framework called "ICD2Vec" designed to provide mathematical representations of diseases by encoding corresponding information. First, we present the arithmetical and semantic relationships between diseases by mapping composite vectors for symptoms or diseases to the most similar ICD codes. Second, we investigated the validity of ICD2Vec by comparing the biological relationships and cosine similarities among the vectorized ICD codes. Third, we propose a new risk score called IRIS, derived from ICD2Vec, and demonstrate its clinical utility with large cohorts from the UK and South Korea.

Results: Semantic compositionality was qualitatively confirmed between descriptions of symptoms and ICD2Vec. For example, the diseases most similar to COVID-19 were found to be the common cold (ICD-10: J00), unspecified viral hemorrhagic fever (ICD-10: A99), and smallpox (ICD-10: B03). We show the significant associations between the cosine similarities derived from ICD2Vec and the biological relationships using disease-to-disease pairs. Furthermore, we observed significant adjusted hazard ratios (HR) and area under the receiver operating characteristics (AUROC) between IRIS and risks for eight diseases. For instance, the higher IRIS for coronary artery disease (CAD) can be the higher probability for the incidence of CAD (HR: 2.15 [95% CI 2.02-2.28] and AUROC: 0.587 [95% CI 0.583-0.591]). We identified individuals at substantially increased risk of CAD using IRIS and 10-year atherosclerotic cardiovascular disease risk (adjusted HR: 4.26 [95% CI 3.59-5.05]).

Conclusions: ICD2Vec, a proposed universal framework for converting qualitatively measured ICD codes into quantitative vectors containing semantic relationships between diseases, exhibited a significant correlation with actual biological significance. In addition, the IRIS was a significant predictor of major diseases in a prospective study using two large-scale datasets. Based on this clinical validity and utility evidence, we suggest that publicly available ICD2Vec can be used in diverse research and clinical practices and has important clinical implications.

Keywords: Embedding; International classification of diseases; Mathematical representation; Risk score.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • COVID-19*
  • Coronary Artery Disease*
  • Humans
  • International Classification of Diseases
  • Prospective Studies
  • ROC Curve
  • Risk Factors