Evaluating missing value imputation methods for food composition databases

Gordana Ispirova; Tome Eftimov; Barbara Koroušić Seljak

doi:10.1016/j.fct.2020.111368

Evaluating missing value imputation methods for food composition databases

Food Chem Toxicol. 2020 Jul:141:111368. doi: 10.1016/j.fct.2020.111368. Epub 2020 May 5.

Authors

Gordana Ispirova¹, Tome Eftimov², Barbara Koroušić Seljak³

Affiliations

¹ Computer Systems Department, Jožef Stefan Institute, Jamova Cesta 39, 1000, Ljubljana, Slovenia; Jožef Stefan International Postgraduate School, Jamova Cesta 39, 1000, Ljubljana, Slovenia. Electronic address: gordana.ispirova@ijs.si.
² Computer Systems Department, Jožef Stefan Institute, Jamova Cesta 39, 1000, Ljubljana, Slovenia.
³ Computer Systems Department, Jožef Stefan Institute, Jamova Cesta 39, 1000, Ljubljana, Slovenia; School of Engineering and Management, University of Nova Gorica, Vipavska 13, 5000, Nova Gorica, Slovenia.

PMID: 32380076
DOI: 10.1016/j.fct.2020.111368

Abstract

Missing data are a common problem in most research fields and introduce an element of ambiguity into data analysis. They can arise due to different reasons: mishandling of samples, measurement error, deleted aberrant value or simply lack of analysis. The nutrition domain is no exception to the problem of missing data. This paper addresses the problem of missing data in food composition databases (FCDBs). Missing data in FCDBs results in incomplete FCDBs, which have limited usage, because any dietary assessment can be performed only on a complete dataset. Most often, this problem is resolved by calculating means/medians from excising data in the same database or borrowing data from other FCDBs. These solutions introduce significant error. We focus on missing data imputation techniques based on methods for substituting missing values with statistical prediction: Non-Negative Matrix Factorization (NMF), Multiple Imputations by Chained Equations (MICE), Nonparametric Missing Value Imputation using Random Forest (MissForest), and K-Nearest Neighbors (KNN), and compared them with commonly used approaches - fill-in with mean, fill-in with median. The data used was from national FCDBs collected by EuroFIR (European Food Information Resource Network). The results show that the state-of-the-art methods for imputation yield better results than the traditional approaches.

Keywords: Food composition databases; Missing data; Missing-data imputation; Nutrient values; food composition data.

MeSH terms

Algorithms
Data Interpretation, Statistical*
Database Management Systems*
Food Analysis*
Nutritive Value