Selection of diagnosis with oncologic relevance information from histopathology free text reports: A machine learning approach

Int J Med Inform. 2022 Apr:160:104714. doi: 10.1016/j.ijmedinf.2022.104714. Epub 2022 Feb 7.

Abstract

Histopathology reports are a primary data source for the case definition phase of a Cancer Registry. By reading the histopathology report, the operator that evaluates an oncology case can define the morphology and topography of cancer, and validate the case with the highest diagnosis base. The key problem of the Catania-Messina-Enna Integrated Cancer Registry (RTI) is that these reports are written in natural language and relevant information for cancer evaluation is only a little part of the total annual histopathological reports. In this population-based retrospective cohort study, we try to optimize the working time spent by the RTI operators in seeking and selecting the right information among the histopathology reports in the east Sicily population, by developing a binary classifier on a training set of labeled historical data and validating its outcome by a test set of labeled data created by the operators during the years. Using a machine learning algorithm we built a classification model that evaluates each free text report and returns a score that indicates the probability that it contains oncologic relevant information. The best performing algorithm, among the eight analyzed in this study, was the LightGBM that reached an F1-Score of 98.9%. Using the chosen classifier we shortened the time for case evaluation, improving the timeliness of cancer statistics.

Keywords: Binary classification; Cancer registry; Machine learning; Natural language processing.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Humans
  • Information Storage and Retrieval
  • Machine Learning*
  • Natural Language Processing*
  • Retrospective Studies