Defining a Preprocessing Pipeline for the MULTI-SITA Project and General Medical Italian Natural Language Data

Alice Cappello; Sara Mora; Daniele Roberto Giacobbe; Matteo Bassetti; Mauro Giacomini; SITA (Italian Society of Anti-Infective Therapy)

doi:10.3233/SHTI230737

Defining a Preprocessing Pipeline for the MULTI-SITA Project and General Medical Italian Natural Language Data

Stud Health Technol Inform. 2023 Oct 20:309:48-52. doi: 10.3233/SHTI230737.

Authors

Alice Cappello¹, Sara Mora², Daniele Roberto Giacobbe^{1

3}, Matteo Bassetti^{1

3}, Mauro Giacomini²; SITA (Italian Society of Anti-Infective Therapy)

Affiliations

¹ Clinica Malattie Infettive, IRCCS Ospedale Policlinico San Martino, Genoa, Italy.
² Department of Informatics, Bioengineering, Robotics and System Engineering, University of Genoa, Genoa, Italy.
³ Department of Health Sciences, University of Genoa, Genoa, Italy.

PMID: 37869804
DOI: 10.3233/SHTI230737

Abstract

The application of Natural Language Processing (NLP) to medical data has revolutionized different aspects of health care. The benefits obtained from the implementation of this technique spill over into several areas, including in the implementation of chatbots, which can provide medical assistance remotely. Every possible application of NLP depends on one first main step: the pre-processing of the corpus retrieved. The raw data must be prepared with the aim to be used efficiently for further analysis. Considerable progress has been made in this direction for the English language but for other languages, such as Italian, the state of the art is not equivalently advanced, especially for texts containing technical medical terms. The aim of this work is to identify and develop a preprocessing pipeline suitable for medical data written in Italian. The pipeline has been developed in Python environment, employing Enchant, ntlk modules and Hugging Face's BERT and BART-based models. Then, it has been tested on real conversations typed between patients and physicians regarding medical questions. The algorithm has been developed within the MULTI-SITA project of the Italian Society of Anti-Infective Therapy (SITA), but shows a flexible structure that can adapt to a large variety of data.

Keywords: BART; BERT; Chatbots; Italian Language; MULTI-SITA; Machine Learning; Medical Data; Natural Language Processing.

MeSH terms

Algorithms*
Humans
Italy
Language*
Natural Language Processing
Writing