Standardised Versioning of Datasets: a FAIR-compliant Proposal

Alba González-Cebrián; Michael Bradford; Adriana E Chis; Horacio González-Vélez

doi:10.1038/s41597-024-03153-y

Standardised Versioning of Datasets: a FAIR-compliant Proposal

Sci Data. 2024 Apr 9;11(1):358. doi: 10.1038/s41597-024-03153-y.

Authors

Alba González-Cebrián¹, Michael Bradford², Adriana E Chis², Horacio González-Vélez²

Affiliations

¹ Cloud Competency Centre, National College of Ireland, Dublin, Ireland. alba.gonzalez-cebrian@ncirl.ie.
² Cloud Competency Centre, National College of Ireland, Dublin, Ireland.

Abstract

This paper presents a standardised dataset versioning framework for improved reusability, recognition and data version tracking, facilitating comparisons and informed decision-making for data usability and workflow integration. The framework adopts a software engineering-like data versioning nomenclature ("major.minor.patch") and incorporates data schema principles to promote reproducibility and collaboration. To quantify changes in statistical properties over time, the concept of data drift metrics (d) is introduced. Three metrics (d_P, d_E,_PCA, and d_E,AE) based on unsupervised Machine Learning techniques (Principal Component Analysis and Autoencoders) are evaluated for dataset creation, update, and deletion. The optimal choice is the d_E,_PCA metric, combining PCA models with splines. It exhibits efficient computational time, with values below 50 for new dataset batches and values consistent with seasonal or trend variations. Major updates (i.e., values of 100) occur when scaling transformations are applied to over 30% of variables while efficiently handling information loss, yielding values close to 0. This metric achieved a favourable trade-off between interpretability, robustness against information loss, and computation time.

MeSH terms

Datasets as Topic* / standards
Machine Learning
Principal Component Analysis
Reproducibility of Results
Software*
Workflow