Transformers for extracting breast cancer information from Spanish clinical narratives

Oswaldo Solarte-Pabón; Orlando Montenegro; Alvaro García-Barragán; Maria Torrente; Mariano Provencio; Ernestina Menasalvas; Víctor Robles

doi:10.1016/j.artmed.2023.102625

Transformers for extracting breast cancer information from Spanish clinical narratives

Artif Intell Med. 2023 Sep:143:102625. doi: 10.1016/j.artmed.2023.102625. Epub 2023 Jul 13.

Authors

Oswaldo Solarte-Pabón¹, Orlando Montenegro², Alvaro García-Barragán³, Maria Torrente⁴, Mariano Provencio⁴, Ernestina Menasalvas³, Víctor Robles³

Affiliations

¹ Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Madrid, Spain; Escuela de Ingeniería de Sistemas, Universidad del Valle, Cali, Colombia. Electronic address: oswaldo.solartep@alumnos.upm.es.
² Escuela de Ingeniería de Sistemas, Universidad del Valle, Cali, Colombia.
³ Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Madrid, Spain.
⁴ Hospital Universitario Puerta de Hierro de Madrid, Madrid, Spain.

PMID: 37673566
DOI: 10.1016/j.artmed.2023.102625

Abstract

The wide adoption of electronic health records (EHRs) offers immense potential as a source of support for clinical research. However, previous studies focused on extracting only a limited set of medical concepts to support information extraction in the cancer domain for the Spanish language. Building on the success of deep learning for processing natural language texts, this paper proposes a transformer-based approach to extract named entities from breast cancer clinical notes written in Spanish and compares several language models. To facilitate this approach, a schema for annotating clinical notes with breast cancer concepts is presented, and a corpus for breast cancer is developed. Results indicate that both BERT-based and RoBERTa-based language models demonstrate competitive performance in clinical Named Entity Recognition (NER). Specifically, BETO and multilingual BERT achieve F-scores of 93.71% and 94.63%, respectively. Additionally, RoBERTa Biomedical attains an F-score of 95.01%, while RoBERTa BNE achieves an F-score of 94.54%. The findings suggest that transformers can feasibly extract information in the clinical domain in the Spanish language, with the use of models trained on biomedical texts contributing to enhanced results. The proposed approach takes advantage of transfer learning techniques by fine-tuning language models to automatically represent text features and avoiding the time-consuming feature engineering process.

Keywords: Breast cancer; Clinical narratives; Deep learning; Named Entity Recognition (NER); Natural Language Processing (NLP).

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Breast Neoplasms*
Deep Learning
Electronic Health Records*
Information Storage and Retrieval
Multilingualism
Natural Language Processing