aws-s3-integrity-check: an open-source bash tool to verify the integrity of a dataset stored on Amazon S3

GigaByte. 2023 Aug 23:2023:gigabyte87. doi: 10.46471/gigabyte.87. eCollection 2023.

Abstract

Amazon Simple Storage Service (Amazon S3) is a widely used platform for storing large biomedical datasets. Unintended data alterations can occur during data writing and transmission, altering the original content and generating unexpected results. However, no open-source and easy-to-use tool exists to verify end-to-end data integrity. Here, we present aws-s3-integrity-check, a user-friendly, lightweight, and reliable bash tool to verify the integrity of a dataset stored in an Amazon S3 bucket. Using this tool, we only needed ∼114 min to verify the integrity of 1,045 records ranging between 5 bytes and 10 gigabytes and occupying ∼935 gigabytes of the Amazon S3 cloud. Our aws-s3-integrity-check tool also provides file-by-file on-screen and log-file-based information about the status of each integrity check. To our knowledge, this tool is the only open-source one that allows verifying the integrity of a dataset uploaded to the Amazon S3 Storage quickly, reliably, and efficiently. The tool is freely available for download and use at https://github.com/SoniaRuiz/aws-s3-integrity-check and https://hub.docker.com/r/soniaruiz/aws-s3-integrity-check.

Grants and funding

SGR, RHR, MG-P, JWB and MR were supported through the award of a Tenure Track Clinician Scientist Fellowship to MR (MR/N008324/1). EKG was supported by the Postdoctoral Fellowship Program in Alzheimer’s Disease Research from the BrightFocus Foundation (Award Number: A2021009F). ZC was supported by a clinical research fellowship from the Leonard Wolfson Foundation. AF-B was supported through the award of a Biotechnology and Biological Sciences Research Council (BBSRC UK) London Interdisciplinary Doctoral Fellowship.