FoodBase corpus: a new resource of annotated food entities

Gorjan Popovski; Barbara Koroušić Seljak; Tome Eftimov

doi:10.1093/database/baz121

FoodBase corpus: a new resource of annotated food entities

Database (Oxford). 2019 Jan 1:2019:baz121. doi: 10.1093/database/baz121.

Authors

Gorjan Popovski^{1

2

3}, Barbara Koroušić Seljak³, Tome Eftimov^{3

4

5}

Affiliations

¹ Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, ul.Rudzer Boshkovikj 16, 1000 Skopje, Macedonia.
² Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia.
³ Computer Systems Department, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia.
⁴ Department of Biomedical Data Science, Stanford University, 450 Serra Mall, Stanford 94305 CA, USA.
⁵ Center for Population Health Sciences, Stanford University, 450 Serra Mall, Stanford 94305 CA, USA.

Abstract

The existence of annotated text corpora is essential for the development of public health services and tools based on natural language processing (NLP) and text mining. Recently organized biomedical NLP shared tasks have provided annotated corpora related to different biomedical entities such as genes, phenotypes, drugs, diseases and chemical entities. These are needed to develop named-entity recognition (NER) models that are used for extracting entities from text and finding their relations. However, to the best of our knowledge, there are limited annotated corpora that provide information about food entities despite food and dietary management being an essential public health issue. Hence, we developed a new annotated corpus of food entities, named FoodBase. It was constructed using recipes extracted from Allrecipes, which is currently the largest food-focused social network. The recipes were selected from five categories: 'Appetizers and Snacks', 'Breakfast and Lunch', 'Dessert', 'Dinner' and 'Drinks'. Semantic tags used for annotating food entities were selected from the Hansard corpus. To extract and annotate food entities, we applied a rule-based food NER method called FoodIE. Since FoodIE provides a weakly annotated corpus, by manually evaluating the obtained results on 1000 recipes, we created a gold standard of FoodBase. It consists of 12 844 food entity annotations describing 2105 unique food entities. Additionally, we provided a weakly annotated corpus on an additional 21 790 recipes. It consists of 274 053 food entity annotations, 13 079 of which are unique. The FoodBase corpus is necessary for developing corpus-based NER models for food science, as a new benchmark dataset for machine learning tasks such as multi-class classification, multi-label classification and hierarchical multi-label classification. FoodBase can be used for detecting semantic differences/similarities between food concepts, and after all we believe that it will open a new path for learning food embedding space that can be used in predictive studies.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Cooking*
Data Curation*
Databases, Factual*
Food*
Humans
Natural Language Processing*