Defining a Preprocessing Pipeline for the MULTI-SITA Project and General Medical Italian Natural Language Data

Stud Health Technol Inform. 2023 Oct 20:309:48-52. doi: 10.3233/SHTI230737.

Abstract

The application of Natural Language Processing (NLP) to medical data has revolutionized different aspects of health care. The benefits obtained from the implementation of this technique spill over into several areas, including in the implementation of chatbots, which can provide medical assistance remotely. Every possible application of NLP depends on one first main step: the pre-processing of the corpus retrieved. The raw data must be prepared with the aim to be used efficiently for further analysis. Considerable progress has been made in this direction for the English language but for other languages, such as Italian, the state of the art is not equivalently advanced, especially for texts containing technical medical terms. The aim of this work is to identify and develop a preprocessing pipeline suitable for medical data written in Italian. The pipeline has been developed in Python environment, employing Enchant, ntlk modules and Hugging Face's BERT and BART-based models. Then, it has been tested on real conversations typed between patients and physicians regarding medical questions. The algorithm has been developed within the MULTI-SITA project of the Italian Society of Anti-Infective Therapy (SITA), but shows a flexible structure that can adapt to a large variety of data.

Keywords: BART; BERT; Chatbots; Italian Language; MULTI-SITA; Machine Learning; Medical Data; Natural Language Processing.

MeSH terms

  • Algorithms*
  • Humans
  • Italy
  • Language*
  • Natural Language Processing
  • Writing