RCorp: a resource for chemical disease semantic extraction in Chinese

BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):234. doi: 10.1186/s12911-019-0936-3.

Abstract

Background: To robustly identify synergistic combinations of drugs, high-throughput screenings are desirable. It will be of great help to automatically identify the relations in the published papers with machine learning based tools. To support the chemical disease semantic relation extraction especially for chronic diseases, a chronic disease specific corpus for combination therapy discovery in Chinese (RCorp) is manually annotated.

Methods: In this study, we extracted abstracts from a Chinese medical literature server and followed the annotation framework of the BioCreative CDR corpus, with the guidelines modified to make the combination therapy related relations available. An annotation tool was incorporated to the standard annotation process.

Results: The resulting RCorp consists of 339 Chinese biomedical articles with 2367 annotated chemicals, 2113 diseases, 237 symptoms, 164 chemical-induce-disease relations, 163 chemical-induce-symptom relations, and 805 chemical-treat-disease relations. Each annotation includes both the mention text spans and normalized concept identifiers. The corpus gets an inter-annotator agreement score of 0.883 for chemical entities, 0.791 for disease entities which are measured by F score. And the F score for chemical-treat-disease relations gets 0.788 after unifying the entity mentions.

Conclusions: We extracted and manually annotated a chronic disease specific corpus for combination therapy discovery in Chinese. The result analysis of the corpus proves its quality for the combination therapy related knowledge discovery task. Our annotated corpus would be a useful resource for the modelling of entity recognition and relation extraction tools. In the future, an evaluation based on the corpus will be held.

Keywords: Chemical-disease relations; Chronic diseases; Combination therapy; Corpus annotation; Relation extraction.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Chronic Disease / therapy*
  • Combined Modality Therapy
  • Data Mining / methods*
  • Humans
  • Language
  • Semantics*