Text mining of cancer-related information: review of current status and future directions

Irena Spasić; Jacqueline Livsey; John A Keane; Goran Nenadić

doi:10.1016/j.ijmedinf.2014.06.009

Text mining of cancer-related information: review of current status and future directions

Int J Med Inform. 2014 Sep;83(9):605-23. doi: 10.1016/j.ijmedinf.2014.06.009. Epub 2014 Jun 24.

Authors

Irena Spasić¹, Jacqueline Livsey², John A Keane³, Goran Nenadić³

Affiliations

¹ School of Computer Science & Informatics, Cardiff University, Cardiff CF24 3AA, UK. Electronic address: i.spasic@cs.cardiff.ac.uk.
² Clinical Outcomes Unit, The Christie NHS Foundation Trust, Manchester M20 4BX, UK.
³ School of Computer Science, The University of Manchester, Manchester M13 9PL, UK; Health e-Research Centre, Manchester M13 9PL, UK; Manchester Institute of Biotecnology, Manchester M1 7DN, UK.

PMID: 25008281
DOI: 10.1016/j.ijmedinf.2014.06.009

Abstract

Purpose: This paper reviews the research literature on text mining (TM) with the aim to find out (1) which cancer domains have been the subject of TM efforts, (2) which knowledge resources can support TM of cancer-related information and (3) to what extent systems that rely on knowledge and computational methods can convert text data into useful clinical information. These questions were used to determine the current state of the art in this particular strand of TM and suggest future directions in TM development to support cancer research.

Methods: A review of the research on TM of cancer-related information was carried out. A literature search was conducted on the Medline database as well as IEEE Xplore and ACM digital libraries to address the interdisciplinary nature of such research. The search results were supplemented with the literature identified through Google Scholar.

Results: A range of studies have proven the feasibility of TM for extracting structured information from clinical narratives such as those found in pathology or radiology reports. In this article, we provide a critical overview of the current state of the art for TM related to cancer. The review highlighted a strong bias towards symbolic methods, e.g. named entity recognition (NER) based on dictionary lookup and information extraction (IE) relying on pattern matching. The F-measure of NER ranges between 80% and 90%, while that of IE for simple tasks is in the high 90s. To further improve the performance, TM approaches need to deal effectively with idiosyncrasies of the clinical sublanguage such as non-standard abbreviations as well as a high degree of spelling and grammatical errors. This requires a shift from rule-based methods to machine learning following the success of similar trends in biological applications of TM. Machine learning approaches require large training datasets, but clinical narratives are not readily available for TM research due to privacy and confidentiality concerns. This issue remains the main bottleneck for progress in this area. In addition, there is a need for a comprehensive cancer ontology that would enable semantic representation of textual information found in narrative reports.

Keywords: Cancer; Data mining; Electronic medical records; Natural language processing.

Publication types

Research Support, Non-U.S. Gov't
Review

MeSH terms

Computational Biology / methods*
Data Mining / trends*
Humans
Information Storage and Retrieval
Medical Oncology*
Neoplasms*

Grants and funding

MC_PC_13042/MRC_/Medical Research Council/United Kingdom