Structuring Legacy Pathology Reports by openEHR Archetypes to Enable Semantic Querying

Methods Inf Med. 2017 May 18;56(3):230-237. doi: 10.3414/ME16-01-0073. Epub 2017 Feb 28.

Abstract

Background: Clinical information is often stored as free text, e.g. in discharge summaries or pathology reports. These documents are semi-structured using section headers, numbered lists, items and classification strings. However, it is still challenging to retrieve relevant documents since keyword searches applied on complete unstructured documents result in many false positive retrieval results.

Objectives: We are concentrating on the processing of pathology reports as an example for unstructured clinical documents. The objective is to transform reports semi-automatically into an information structure that enables an improved access and retrieval of relevant data. The data is expected to be stored in a standardized, structured way to make it accessible for queries that are applied to specific sections of a document (section-sensitive queries) and for information reuse.

Methods: Our processing pipeline comprises information modelling, section boundary detection and section-sensitive queries. For enabling a focused search in unstructured data, documents are automatically structured and transformed into a patient information model specified through openEHR archetypes. The resulting XML-based pathology electronic health records (PEHRs) are queried by XQuery and visualized by XSLT in HTML.

Results: Pathology reports (PRs) can be reliably structured into sections by a keyword-based approach. The information modelling using openEHR allows saving time in the modelling process since many archetypes can be reused. The resulting standardized, structured PEHRs allow accessing relevant data by retrieving data matching user queries.

Conclusions: Mapping unstructured reports into a standardized information model is a practical solution for a better access to data. Archetype-based XML enables section-sensitive retrieval and visualisation by well-established XML techniques. Focussing the retrieval to particular sections has the potential of saving retrieval time and improving the accuracy of the retrieval.

Keywords: Standardized electronic health record; electronic health record system; information retrieval; medical informatics applications; openEHR; section boundary detection.

MeSH terms

  • Data Curation / standards*
  • Electronic Health Records / standards*
  • Guidelines as Topic
  • Health Information Interoperability / standards*
  • Information Storage and Retrieval / standards*
  • Machine Learning
  • Medical Record Linkage / methods
  • Medical Record Linkage / standards
  • Natural Language Processing*
  • Pathology / organization & administration*
  • Semantics*
  • Vocabulary, Controlled