Data-driven Sublanguage Analysis for Cancer Genomics Knowledge Modeling: Applications in Mining Oncological Genetics Information from Patients' Genetic Reports

Yiqing Zhao; Hanzhong Yu; Sunyang Fu; Feichen Shen; Jaime I Davila; Hongfang Liu; Chen Wang

Data-driven Sublanguage Analysis for Cancer Genomics Knowledge Modeling: Applications in Mining Oncological Genetics Information from Patients' Genetic Reports

AMIA Jt Summits Transl Sci Proc. 2020 May 30:2020:720-729. eCollection 2020.

Authors

Yiqing Zhao¹, Hanzhong Yu¹, Sunyang Fu¹, Feichen Shen¹, Jaime I Davila¹, Hongfang Liu¹, Chen Wang¹

Affiliation

¹ Division of Digital Health Sciences, Mayo Clinic, Rochester, MN.

PMID: 32477695
PMCID: PMC7233104

Abstract

Despite an abundance of information in clinical genetic testing reports, information is oftentimes not well documented/utilized for decision making. Unstructured information in genetic reports can contribute to long-term patient management and future translational research. Thus, we proposed a knowledge model that could manage unstructured information in medical genetic reports and facilitate knowledge extraction, curation and updating. For this pilot study, we used a dataset including 1,565 cancer genetics reports of Mayo Clinic patients. We used a previously developed, data-driven discovery pipeline that involves both semantic annotation and co-occurrence association analysis to establish a knowledge model. We showed that compared to genetic reports, around 56% of testing results are missing or incomplete in the clinical notes. We built a genetic report knowledge model and highlighted four key semantic groups including "Genes and Gene Products" and "Treatments". Coverage of term annotation was 99.5%. Accuracies of term annotation and relationship extraction were 98.9% and 92.9% respectively.