Machine learning application for incident prostate adenocarcinomas automatic registration in a French regional cancer registry

Int J Med Inform. 2020 Jul:139:104139. doi: 10.1016/j.ijmedinf.2020.104139. Epub 2020 Apr 9.

Abstract

Cancer registries are collections of curated data about malignant tumor diseases. The amount of data processed by cancer registries increases every year, making manual registration more and more tedious.

Objective: We sought to develop an automatic analysis pipeline that would be able to identify and preprocess registry input for incident prostate adenocarcinomas in a French regional cancer registry.

Methods: Notifications from different sources submitted to the Bas-Rhin cancer registry were used here: pathology data and, ICD 10 diagnosis codes from hospital discharge data and healthcare insurance data. We trained a Support Vector Machine model (machine learning) to predict whether patient's data must be considered or not as a prostate adenocarcinoma incident case that should therefore be registered. The final registration of all identified cases was manually confirmed by a specialized technician. Text mining tools (regular expressions) were used to extract clinical and biological data from non-structured pathology reports.

Results: We performed two successive analyses. First, we used 982 cases manually labeled by registrars from the 2014 dataset to predict the registration of 785 cases submitted in 2015. Then, we repeated the procedure using the 2089 cases labeled by registrars from the 2014 and 2015 datasets to predict the registration of 926 cases submitted in the 2016 data. The algorithm identified 663 cases of prostate adenocarcinoma in 2015, and 610 in 2016. From these findings, 663 and 531 cases were respectively added to the registry; and 641 and 512 cases were confirmed by the specialized technician. This registration process has achieved a precision level above 96 %. The algorithm obtained an overall precision of 99 % (99.5 % in 2015 and 98.5 % in 2016) and a recall of 97 % (97.8 % in 2015 and 96.9 % in 2016). When the information was found in pathology report, text mining was more than 90 % accuracy for major indicators: PSA test, Gleason score, and incidence date). For both PSA and tumor side, information was not detected in the majority of cases."

Conclusion: Machine learning was able to identify new cases of prostate cancer, and text mining was able to prefill the data about incident cases. Machine-learning-based automation of the registration process could reduce delays in data production and allow investigators to devote more time to complex tasks and analysis.

Keywords: Cancer registry; Machine learning; Prostate adenocarcinoma.

MeSH terms

  • Adenocarcinoma / epidemiology*
  • Adenocarcinoma / pathology*
  • Algorithms*
  • Data Mining / methods
  • France / epidemiology
  • Humans
  • Incidence
  • International Classification of Diseases
  • Machine Learning*
  • Male
  • Prostatic Neoplasms / epidemiology*
  • Prostatic Neoplasms / pathology*
  • Registries / statistics & numerical data*