An open source corpus and automatic tool for section identification in Spanish health records

Iker de la Iglesia; María Vivó; Paula Chocrón; Gabriel de Maeztu; Koldo Gojenola; Aitziber Atutxa

doi:10.1016/j.jbi.2023.104461

An open source corpus and automatic tool for section identification in Spanish health records

J Biomed Inform. 2023 Sep:145:104461. doi: 10.1016/j.jbi.2023.104461. Epub 2023 Aug 2.

Authors

Iker de la Iglesia¹, María Vivó², Paula Chocrón³, Gabriel de Maeztu⁴, Koldo Gojenola⁵, Aitziber Atutxa⁶

Affiliations

¹ HiTZ Basque Center for Language Technology Faculty of Engineering Bilbao University of the Basque Country (UPV/EHU), Spain(1). Electronic address: iker.delaiglesia@ehu.eus.
² IOMED Medical Solutions SL, Barcelona, Spain(2). Electronic address: maria.vivo@iomed.es.
³ IOMED Medical Solutions SL, Barcelona, Spain(2). Electronic address: paula.chocron@iomed.es.
⁴ IOMED Medical Solutions SL, Barcelona, Spain(2). Electronic address: gabriel.maeztu@iomed.es.
⁵ HiTZ Basque Center for Language Technology Faculty of Engineering Bilbao University of the Basque Country (UPV/EHU), Spain(1). Electronic address: koldo.gojenola@ehu.eus.
⁶ HiTZ Basque Center for Language Technology Faculty of Engineering Bilbao University of the Basque Country (UPV/EHU), Spain(1). Electronic address: aitziber.atucha@ehu.eus.

PMID: 37536643
DOI: 10.1016/j.jbi.2023.104461

Abstract

Background: Electronic Clinical Narratives (ECNs) store valuable individual's health information. However, there are few available open-source data. Besides, ECNs can be structurally heterogeneous, ranging from documents with explicit section headings or titles to unstructured notes. This lack of structure complicates building automatic systems and their evaluation.

Objective: The aim of the present work is to provide the scientific community with a Spanish open-source dataset to build and evaluate automatic section identification systems. Together with this dataset, the purpose is to design and implement a suitable evaluation measure and a fine-tuned language model adapted to the task.

Materials and methods: A corpus of unstructured clinical records, in this case progress notes written in Spanish, was annotated with seven major section types. Existing metrics for the presented task were thoroughly assessed and, based on the most suitable one, we defined a new B2 metric better tailored given the task.

Results: The annotated corpus, as well as the designed new evaluation script and a baseline model are freely available for the community. This model reaches an average B2 score of 71.3 on our open source dataset and an average B2 of 67.0 in data scarcity scenarios where the target corpus and its structure differs from the dataset used for training the LM.

Conclusion: Although section identification in unstructured clinical narratives is challenging, this work shows that it is possible to build competitive automatic systems when both data and the right evaluation metrics are available. The annotated data, the implemented evaluation scripts, and the section identification Language Model are open-sourced hoping that this contribution will foster the building of more and better systems.

Keywords: Deep learning; Language models; Natural language processing; Section identification; Unstructured clinical documents.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Electronic Health Records*
Language*
Natural Language Processing