Real-world data medical knowledge graph: construction and applications

Linfeng Li; Peng Wang; Jun Yan; Yao Wang; Simin Li; Jinpeng Jiang; Zhe Sun; Buzhou Tang; Tsung-Hui Chang; Shenghui Wang; Yuting Liu

doi:10.1016/j.artmed.2020.101817

Real-world data medical knowledge graph: construction and applications

Artif Intell Med. 2020 Mar:103:101817. doi: 10.1016/j.artmed.2020.101817. Epub 2020 Feb 6.

Authors

Linfeng Li¹, Peng Wang², Jun Yan³, Yao Wang³, Simin Li³, Jinpeng Jiang³, Zhe Sun³, Buzhou Tang⁴, Tsung-Hui Chang⁵, Shenghui Wang⁶, Yuting Liu⁷

Affiliations

¹ Institute of Information Science, Beijing Jiaotong University, Beijing, China; Yidu Cloud Technology Inc., Beijing, China.
² College of Computer Science, Chongqing University, Chongqing, China; Southwest Hospital, Chongqing, China.
³ Yidu Cloud Technology Inc., Beijing, China.
⁴ Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China.
⁵ The School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China.
⁶ Institute of Information Science, Beijing Jiaotong University, Beijing, China.
⁷ School of Science, Beijing Jiaotong University, Beijing, China. Electronic address: ytliu@bjtu.edu.cn.

PMID: 32143785
DOI: 10.1016/j.artmed.2020.101817

Abstract

Objective: Medical knowledge graph (KG) is attracting attention from both academic and healthcare industry due to its power in intelligent healthcare applications. In this paper, we introduce a systematic approach to build medical KG from electronic medical records (EMRs) with evaluation by both technical experiments and end to end application examples.

Materials and methods: The original data set contains 16,217,270 de-identified clinical visit data of 3,767,198 patients. The KG construction procedure includes 8 steps, which are data preparation, entity recognition, entity normalization, relation extraction, property calculation, graph cleaning, related-entity ranking, and graph embedding respectively. We propose a novel quadruplet structure to represent medical knowledge instead of the classical triplet in KG. A novel related-entity ranking function considering probability, specificity and reliability (PSR) is proposed. Besides, probabilistic translation on hyperplanes (PrTransH) algorithm is used to learn graph embedding for the generated KG.

Results: A medical KG with 9 entity types including disease, symptom, etc. was established, which contains 22,508 entities and 579,094 quadruplets. Compared with term frequency - inverse document frequency (TF/IDF) method, the normalized discounted cumulative gain (NDCG@10) increased from 0.799 to 0.906 with the proposed ranking function. The embedding representation for all entities and relations were learned, which are proven to be effective using disease clustering.

Conclusion: The established systematic procedure can efficiently construct a high-quality medical KG from large-scale EMRs. The proposed ranking function PSR achieves the best performance under all relations, and the disease clustering result validates the efficacy of the learned embedding vector as entity's semantic representation. Moreover, the obtained KG finds many successful applications due to its statistics-based quadruplet. where N_co^min is a minimum co-occurrence number and R is the basic reliability value. The reliability value can measure how reliable is the relationship between S_i and O_ij. The reason for the definition is the higher value of N_co(S_i, O_ij), the relationship is more reliable. However, the reliability values of the two relationships should not have a big difference if both of their co-occurrence numbers are very big. In our study, we finally set N_co^min = 10 and R = 1 after some experiments. For instance, if co-occurrence numbers of three relationships are 1, 100 and 10000, their reliability values are 1, 2.96 and 5 respectively.

Keywords: CDSS; PSR; medical knowledge graph; quadruplet; real-world data.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Databases, Factual*
Electronic Health Records / organization & administration*
Humans
Pattern Recognition, Automated / methods*
Reproducibility of Results
Semantics*