The CLEF corpus: semantic annotation of clinical text

Angus Roberts; Robert Gaizauskas; Mark Hepple; Neil Davis; George Demetriou; Yikun Guo; Jay Kola; Ian Roberts; Andrea Setzer; Archana Tapuria; Bill Wheeldin

The CLEF corpus: semantic annotation of clinical text

AMIA Annu Symp Proc. 2007 Oct 11:2007:625-9.

Authors

Angus Roberts¹, Robert Gaizauskas, Mark Hepple, Neil Davis, George Demetriou, Yikun Guo, Jay Kola, Ian Roberts, Andrea Setzer, Archana Tapuria, Bill Wheeldin

Affiliation

¹ Natural Language Processing Group, University of Sheffield, UK.

PMID: 18693911
PMCID: PMC2655900

Abstract

The Clinical E-Science Framework (CLEF) project is building a framework for the capture, integration and presentation of clinical information: for clinical research, evidence-based health care and genotype-meets-phenotype informatics. A significant portion of the information required by such a framework originates as text, even in EHR-savvy organizations. CLEF uses Information Extraction (IE) to make this unstructured information available. An important part of IE is the identification of semantic entities and relationships. Typical approaches require human annotated documents to provide both evaluation standards and material for system development. CLEF has a corpus of clinical narratives, histopathology reports and imaging reports from 20 thousand patients. We describe the selection of a subset of this corpus for manual annotation of clinical entities and relationships. We describe an annotation methodology and report encouraging initial results of inter-annotator agreement. Comparisons are made between different text sub-genres, and between annotators with different skills.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Humans
Information Storage and Retrieval / methods*
Medical Records Systems, Computerized*
Natural Language Processing*
Semantics