Natural language processing and machine learning approaches for food categorization and nutrition quality prediction compared with traditional methods

Am J Clin Nutr. 2023 Mar;117(3):553-563. doi: 10.1016/j.ajcnut.2022.11.022. Epub 2022 Dec 23.

Abstract

Background: Food categorization and nutrient profiling are labor intensive, time consuming, and costly tasks, given the number of products and labels in large food composition databases and the dynamic food supply.

Objectives: This study used a pretrained language model and supervised machine learning to automate food category classification and nutrition quality score prediction based on manually coded and validated data, and compared prediction results with models using bag-of-words and structured nutrition facts as inputs for predictions.

Methods: Food product information from University of Toronto Food Label Information and Price Database 2017 (n = 17,448) and University of Toronto Food Label Information and Price Database 2020 (n = 74,445) databases were used. Health Canada's Table of Reference Amounts (TRA) (24 categories and 172 subcategories) was used for food categorization and the Food Standards of Australia and New Zealand (FSANZ) nutrient profiling system was used for nutrition quality score evaluation. TRA categories and FSANZ scores were manually coded and validated by trained nutrition researchers. A modified pretrained sentence-Bidirectional Encoder Representations from Transformers model was used to encode unstructured text from food labels into lower-dimensional vector representations, followed by supervised machine learning algorithms (i.e., elastic net, k-Nearest Neighbors, and XGBoost) for multiclass classification and regression tasks.

Results: Pretrained language model representations utilized by the XGBoost multiclass classification algorithm reached overall accuracy scores of 0.98 and 0.96 in predicting food TRA major and subcategories, outperforming bag-of-words methods. For FSANZ score prediction, our proposed method reached a similar prediction accuracy (R2: 0.87 and MSE: 14.4) compared with bag-of-words methods (R2: 0.72-0.84; MSE: 30.3-17.6), whereas structured nutrition facts machine learning model performed the best (R2: 0.98; MSE: 2.5). The pretrained language model had a higher generalizable ability on the external test datasets than bag-of-words methods.

Conclusions: Our automation achieved high accuracy in classifying food categories and predicting nutrition quality scores using text information found on food labels. This approach is effective and generalizable in a dynamic food environment, where large amounts of food label data can be obtained from websites.

Keywords: food categorization; food composition database; food label; machine learning; natural language processing (NLP); nutrient profiling; nutrition facts table; nutrition quality; pretrained language model.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Food*
  • Humans
  • Machine Learning
  • Natural Language Processing*
  • Nutritional Status
  • Nutritive Value