Animal disease surveillance: How to represent textual data for classifying epidemiological information

Sarah Valentin; Rémy Decoupes; Renaud Lancelot; Mathieu Roche

doi:10.1016/j.prevetmed.2023.105932

Animal disease surveillance: How to represent textual data for classifying epidemiological information

Prev Vet Med. 2023 Jul:216:105932. doi: 10.1016/j.prevetmed.2023.105932. Epub 2023 May 12.

Authors

Sarah Valentin¹, Rémy Decoupes², Renaud Lancelot³, Mathieu Roche⁴

Affiliations

¹ CIRAD, F-34398 Montpellier, France; ASTRE, Univ Montpellier, CIRAD, INRAE, Montpellier, France; TETIS, Univ Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, France; Département de Biologie, Université de Sherbrooke, Sherbrooke, Québec, Canada.
² TETIS, Univ Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, France.
³ CIRAD, F-34398 Montpellier, France; ASTRE, Univ Montpellier, CIRAD, INRAE, Montpellier, France.
⁴ CIRAD, F-34398 Montpellier, France; TETIS, Univ Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, France. Electronic address: mathieu.roche@cirad.fr.

PMID: 37247579
DOI: 10.1016/j.prevetmed.2023.105932

Abstract

The value of informal sources in increasing the timeliness of disease outbreak detection and providing detailed epidemiological information in the early warning and preparedness context is recognized. This study evaluates machine learning methods for classifying information from animal disease-related news at a fine-grained level (i.e., epidemiological topic). We compare two textual representations, the bag-of-words method and a distributional approach, i.e., word embeddings. Both representations performed well for binary relevance classification (F-measure of 0.839 and 0.871, respectively). Bag-of-words representation was outperformed by word embedding representation for classifying sentences into fine-grained epidemiological topics (F-measure of 0.745). Our results suggest that the word embedding approach is of interest in the context of low-frequency classes in a specialized domain. However, this representation did not bring significant performance improvements for binary relevance classification, indicating that the textual representation should be adapted to each classification task.

Keywords: Animal disease surveillance; Classification; Word embedding.

MeSH terms

Animal Diseases* / epidemiology
Animals
Machine Learning*