CafeteriaSA corpus: scientific abstracts annotated across different food semantic resources

Gjorgjina Cenikj; Eva Valenčič; Gordana Ispirova; Matevž Ogrinc; Riste Stojanov; Peter Korošec; Ermanno Cavalli; Barbara Koroušić Seljak; Tome Eftimov

doi:10.1093/database/baac107

CafeteriaSA corpus: scientific abstracts annotated across different food semantic resources

Database (Oxford). 2022 Dec 16:2022:baac107. doi: 10.1093/database/baac107.

Authors

Gjorgjina Cenikj^{1

2}, Eva Valenčič^{1

2

3

4}, Gordana Ispirova^{1

2}, Matevž Ogrinc^{1

2}, Riste Stojanov⁵, Peter Korošec¹, Ermanno Cavalli⁶, Barbara Koroušić Seljak^{1

2}, Tome Eftimov¹

Affiliations

¹ Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia.
² Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia.
³ School of Health Sciences, College of Health, Medicine and Wellbeing, University of Newcastle, University Drive, Callaghan Campus, Newcastle, NSW 2308, Australia.
⁴ Food and Nutrition Program, Hunter Medical Research Institute, Lot 1 Kookaburra Circuit, New Lambton Heights, Newcastle, NSW 2305, Australia.
⁵ Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Ruger Boshkovikj 16, Skopje 1000, North Macedonia.
⁶ European Food Safety Authority, Via Carlo Magno 1A, Parma 43126, Italy.

Abstract

In the last decades, a great amount of work has been done in predictive modeling of issues related to human and environmental health. Resolution of issues related to healthcare is made possible by the existence of several biomedical vocabularies and standards, which play a crucial role in understanding the health information, together with a large amount of health-related data. However, despite a large number of available resources and work done in the health and environmental domains, there is a lack of semantic resources that can be utilized in the food and nutrition domain, as well as their interconnections. For this purpose, in a European Food Safety Authority-funded project CAFETERIA, we have developed the first annotated corpus of 500 scientific abstracts that consists of 6407 annotated food entities with regard to Hansard taxonomy, 4299 for FoodOn and 3623 for SNOMED-CT. The CafeteriaSA corpus will enable the further development of natural language processing methods for food information extraction from textual data that will allow extracting food information from scientific textual data. Database URL: https://zenodo.org/record/6683798#.Y49wIezMJJF.

MeSH terms

Databases, Factual
Humans
Information Storage and Retrieval
Natural Language Processing*
Semantics*