Automatic extraction of cancer characteristics from free-text pathology reports for cancer notifications

Stud Health Technol Inform. 2011:168:117-24.

Abstract

Objective: To develop a system for the automatic classification of Cancer Registry notifications data from free-text pathology reports.

Method: The underlying technology used for the extraction of cancer notification items is based on the symbolic rule-based classification methodology, whereby formal semantics are used to reason with the systematised nomenclature of medicine - clinical terms (SNOMED CT) concepts identified in the free text. Business rules for cancer notifications used by Cancer Registry coding staff were also incorporated with the aim to mimic Cancer Registry processes.

Results: The system was developed on a corpus of 239 histology and cytology reports (with 60% notifiable reports), and then evaluated on an independent set of 300 reports (with 20% notifiable reports). Results show that the system can reliably classify notifiable reports with 96% and 100% specificity, and achieve an overall accuracy of 82% and 74% for classifying notification items from notifiable reports at a unit record level from the development and evaluation set, respectively.

Conclusion: Cancer Registries collect a multitude of data that requires manual review, slowing down the flow of information. Extracting and providing an automatically coded cancer pathology notification for review can lessen the reliance on expert clinical staff, improving the efficiency and availability of cancer information.

MeSH terms

  • Data Mining / methods*
  • Disease Notification*
  • Humans
  • Neoplasms / pathology*
  • Registries
  • Systematized Nomenclature of Medicine