Data pre-processing to improve the mining of large feed databases

F Maroto-Molina; A Gómez-Cabrera; J E Guerrero-Ginel; A Garrido-Varo; D Sauvant; G Tran; V Heuzé; D C Pérez-Marín

doi:10.1017/S1751731113000293

Data pre-processing to improve the mining of large feed databases

Animal. 2013 Jul;7(7):1128-36. doi: 10.1017/S1751731113000293. Epub 2013 Mar 8.

Authors

F Maroto-Molina¹, A Gómez-Cabrera, J E Guerrero-Ginel, A Garrido-Varo, D Sauvant, G Tran, V Heuzé, D C Pérez-Marín

Affiliation

¹ Servicio de Información sobre Alimentos, Universidad de Córdoba, Ctra. Nacional IV km. 396, 14014, Córdoba, Spain. g02mamof@uco.es

PMID: 23473337
DOI: 10.1017/S1751731113000293

Abstract

The information stored in animal feed databases is highly variable, in terms of both provenance and quality; therefore, data pre-processing is essential to ensure reliable results. Yet, pre-processing at best tends to be unsystematic; at worst, it may even be wholly ignored. This paper sought to develop a systematic approach to the various stages involved in pre-processing to improve feed database outputs. The database used contained analytical and nutritional data on roughly 20 000 alfalfa samples. A range of techniques were examined for integrating data from different sources, for detecting duplicates and, particularly, for detecting outliers. Special attention was paid to the comparison of univariate and multivariate solutions. Major issues relating to the heterogeneous nature of data contained in this database were explored, the observed outliers were characterized and ad hoc routines were designed for error control. Finally, a heuristic diagram was designed to systematize the various aspects involved in the detection and management of outliers and errors.

Publication types

Evaluation Study

MeSH terms

Animal Feed*
Animal Husbandry / methods*
Data Interpretation, Statistical
Data Mining / methods*
Databases, Factual*
Medicago sativa