Ten simple rules for writing Dockerfiles for reproducible data science

Daniel Nüst; Vanessa Sochat; Ben Marwick; Stephen J Eglen; Tim Head; Tony Hirst; Benjamin D Evans

doi:10.1371/journal.pcbi.1008316

Ten simple rules for writing Dockerfiles for reproducible data science

PLoS Comput Biol. 2020 Nov 10;16(11):e1008316. doi: 10.1371/journal.pcbi.1008316. eCollection 2020 Nov.

Authors

Daniel Nüst¹, Vanessa Sochat², Ben Marwick³, Stephen J Eglen⁴, Tim Head⁵, Tony Hirst⁶, Benjamin D Evans⁷

Affiliations

¹ Institute for Geoinformatics, University of Münster, Münster, Germany.
² Stanford Research Computing Center, Stanford University, Stanford, California, United States of America.
³ Department of Anthropology, University of Washington, Seattle, Washington, United States of America.
⁴ Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, Cambridgeshire, Great Britain.
⁵ Wild Tree Tech, Zurich, Switzerland.
⁶ Department of Computing and Communications, The Open University, Great Britain.
⁷ School of Psychological Science, University of Bristol, Bristol, Great Britain.

Abstract

Computational science has been greatly improved by the use of containers for packaging software and data dependencies. In a scholarly context, the main drivers for using these containers are transparency and support of reproducibility; in turn, a workflow's reproducibility can be greatly affected by the choices that are made with respect to building containers. In many cases, the build process for the container's image is created from instructions provided in a Dockerfile format. In support of this approach, we present a set of rules to help researchers write understandable Dockerfiles for typical data science workflows. By following the rules in this article, researchers can create containers suitable for sharing with fellow scientists, for including in scholarly communication such as education or scientific papers, and for effective and sustainable personal workflows.

Publication types

Editorial
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Data Science*
Guidelines as Topic*
Programming Languages*
Reproducibility of Results
Software*

Grants and funding

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. DN is supported by the project Opening Reproducible Research II (https://o2r.info/; https://www.uni-muenster.de/forschungaz/project/12343) funded by the German Research Foundation (DFG) under project number PE 1632/17-1. DN and SJE are supported by a Mozilla mini science grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.