A categorical analysis of coreference resolution errors in biomedical texts

J Biomed Inform. 2016 Apr:60:309-18. doi: 10.1016/j.jbi.2016.02.015. Epub 2016 Feb 27.

Abstract

Background: Coreference resolution is an essential task in information extraction from the published biomedical literature. It supports the discovery of complex information by linking referring expressions such as pronouns and appositives to their referents, which are typically entities that play a central role in biomedical events. Correctly establishing these links allows detailed understanding of all the participants in events, and connecting events together through their shared participants.

Results: As an initial step towards the development of a novel coreference resolution system for the biomedical domain, we have categorised the characteristics of coreference relations by type of anaphor as well as broader syntactic and semantic characteristics, and have compared the performance of a domain adaptation of a state-of-the-art general system to published results from domain-specific systems in terms of this categorisation. We also develop a rule-based system for anaphoric coreference resolution in the biomedical domain with simple modules derived from available systems. Our results show that the domain-specific systems outperform the general system overall. Whilst this result is unsurprising, our proposed categorisation enables a detailed quantitative analysis of the system performance. We identify limitations of each system and find that there remain important gaps in the state-of-the-art systems, which are clearly identifiable with respect to the categorisation.

Conclusion: We have analysed in detail the performance of existing coreference resolution systems for the biomedical literature and have demonstrated that there clear gaps in their coverage. The approach developed in the general domain needs to be tailored for portability to the biomedical domain. The specific framework for class-based error analysis of existing systems that we propose has benefits for identifying specific limitations of those systems. This in turn provides insights for further system development.

Keywords: Coreference resolution; Error analysis; Natural language processing; Text mining.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Data Mining / methods*
  • Electronic Health Records*
  • False Negative Reactions
  • Humans
  • Language*
  • Medical Informatics
  • Natural Language Processing*
  • Pattern Recognition, Automated
  • Problem Solving
  • Publications
  • Reproducibility of Results
  • Semantics