RUBY: Natural Language Processing of French Electronic Medical Records for Breast Cancer Research

Renaud Schiappa; Sara Contu; Dorian Culie; Brice Thamphya; Yann Chateau; Jocelyn Gal; Caroline Bailleux; Juliette Haudebourg; Jean-Marc Ferrero; Emmanuel Barranger; Emmanuel Chamorey

doi:10.1200/CCI.21.00199

RUBY: Natural Language Processing of French Electronic Medical Records for Breast Cancer Research

JCO Clin Cancer Inform. 2022 Jul:6:e2100199. doi: 10.1200/CCI.21.00199.

Affiliations

¹ Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France.
² Cervico-facial Oncology Surgical Department, University Institute of Face and Neck, University of Côte d'Azur, Nice, France.
³ Department of Medical Oncology, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France.
⁴ Anatomy and Pathological Cytology Laboratory, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France.

Abstract

Purpose: Electronic medical records are a valuable source of information about patients' clinical status but are often free-text documents that require laborious manual review to be exploited. Techniques from computer science have been investigated, but the literature has marginally focused on non-English language texts. We developed RUBY, a tool designed in collaboration with IBM-France to automatically structure clinical information from French medical records of patients with breast cancer.

Materials and methods: RUBY, which exploits state-of-the-art Named Entity Recognition models combined with keyword extraction and postprocessing rules, was applied on clinical texts. We investigated the precision of RUBY in extracting the target information.

Results: RUBY has an average precision of 92.8% for the Surgery report, 92.7% for the Pathology report, 98.1% for the Biopsy report, and 81.8% for the Consultation report.

Conclusion: These results show that the automatic approach has the potential to effectively extract clinical knowledge from an extensive set of electronic medical records, reducing the manual effort required and saving a significant amount of time. A deeper semantic analysis and further understanding of the context in the text, as well as training on a larger and more recent set of reports, including those containing highly variable entities and the use of ontologies, could further improve the results.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Breast Neoplasms* / diagnosis
Breast Neoplasms* / therapy
Electronic Health Records
Female
France
Humans
Natural Language Processing*
Semantics