Deep neural models for extracting entities and relationships in the new RDD corpus relating disabilities and rare diseases

Comput Methods Programs Biomed. 2018 Oct:164:121-129. doi: 10.1016/j.cmpb.2018.07.007. Epub 2018 Jul 20.

Abstract

Background and objective: There is a huge amount of rare diseases, many of which have associated important disabilities. It is paramount to know in advance the evolution of the disease in order to limit and prevent the appearance of disabilities and to prepare the patient to manage the future difficulties. Rare disease associations are making an effort to manually collect this information, but it is a long process. A lot of information about the consequences of rare diseases is published in scientific papers, and our goal is to automatically extract disabilities associated with diseases from them.

Methods: This work presents a new corpus of abstracts from scientific papers related to rare diseases, which has been manually annotated with disabilities. This corpus allows to train machine and deep learning systems that can automatically process other papers, thus extracting new information about the relations between rare diseases and disabilities. The corpus is also annotated with negation and speculation when they appear affecting disabilities. The corpus has been made publicly accessible.

Results: We have devised some experiments using deep learning techniques to show the usefulness of the developed corpus. Specifically, we have designed a long short-term memory based architecture for disabilities identification, as well as a convolutional neural network for detecting their relationships to diseases. The systems designed do not need any preprocessing of the data, but only low dimensional vectors representing the words.

Conclusions: The developed corpus will allow to train systems to identify disabilities in biomedical documents, which the current annotation systems are not able to detect. The system could also be trained to detect relationships between them and diseases, as well as negation and speculation, that can change the meaning of the language. The deep learning models designed for identifying disabilities and their relationships to diseases in new documents show that the corpus allows obtaining an F-measure of around 81% for the disability recognition and 75% for relation extraction.

Keywords: Biomedical corpora; Deep neural networks; Disabilities; Entity recognition; Rare diseases; Relationship classification.

MeSH terms

  • Data Mining
  • Databases, Factual / statistics & numerical data
  • Deep Learning
  • Disabled Persons / statistics & numerical data*
  • Humans
  • Natural Language Processing
  • Neural Networks, Computer*
  • Rare Diseases / etiology*
  • Semantics