Eleven quick tips for data cleaning and feature engineering

Davide Chicco; Luca Oneto; Erica Tavazzi

doi:10.1371/journal.pcbi.1010718

Eleven quick tips for data cleaning and feature engineering

PLoS Comput Biol. 2022 Dec 15;18(12):e1010718. doi: 10.1371/journal.pcbi.1010718. eCollection 2022 Dec.

Authors

Davide Chicco¹, Luca Oneto^{2

3}, Erica Tavazzi⁴

Affiliations

¹ Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada.
² Dipartimento di Informatica Bioingegneria Robotica e Ingegneria dei Sistemi, Università di Genova, Genoa, Italy.
³ ZenaByte S.r.l., Genoa, Italy.
⁴ Dipartimento di Ingegneria dell'Informazione, Università di Padova, Padua, Italy.

Abstract

Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call "feature" a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.

Copyright: © 2022 Chicco et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Computational Biology* / methods
Engineering
Humans
Machine Learning*

Grants and funding

The authors received no specific funding for this work.