KARAJ: An Efficient Adaptive Multi-Processor Tool to Streamline Genomic and Transcriptomic Sequence Data Acquisition

Mahdieh Labani; Amin Beheshti; Nigel H Lovell; Hamid Alinejad-Rokny; Ali Afrasiabi

doi:10.3390/ijms232214418

KARAJ: An Efficient Adaptive Multi-Processor Tool to Streamline Genomic and Transcriptomic Sequence Data Acquisition

Int J Mol Sci. 2022 Nov 20;23(22):14418. doi: 10.3390/ijms232214418.

Authors

Mahdieh Labani^{1

2}, Amin Beheshti², Nigel H Lovell^{3

4}, Hamid Alinejad-Rokny^{1

5

6}, Ali Afrasiabi^{1

7}

Affiliations

¹ Biomedical Machine Learning Lab, The Graduate School of Biomedical Engineering, University of New South Wales (UNSW), Sydney, NSW 2052, Australia.
² Data Analytics Lab, Department of Computing, Macquarie University, Sydney, NSW 2109, Australia.
³ The Graduate School of Biomedical Engineering (GSBmE), University of New South Wales (UNSW), Sydney, NSW 2052, Australia.
⁴ Tyree Institute of Health Engineering (IHealthE), University of New South Wales (UNSW), Sydney, NSW 2052, Australia.
⁵ UNSW Data Science Hub, University of New South Wales (UNSW), Sydney, NSW 2052, Australia.
⁶ Health Data Analytics Program, Centre for Applied Artificial Intelligence, Macquarie University, Sydney, NSW 2109, Australia.
⁷ Centre for Immunology and Allergy Research, Westmead Institute for Medical Research, University of Sydney, Sydney, NSW 2006, Australia.

Abstract

Here we developed KARAJ, a fast and flexible Linux command-line tool to automate the end-to-end process of querying and downloading a wide range of genomic and transcriptomic sequence data types. The input to KARAJ is a list of PMCIDs or publication URLs or various types of accession numbers to automate four tasks as follows; firstly, it provides a summary list of accessible datasets generated by or used in these scientific articles, enabling users to select appropriate datasets; secondly, KARAJ calculates the size of files that users want to download and confirms the availability of adequate space on the local disk; thirdly, it generates a metadata table containing sample information and the experimental design of the corresponding study; and lastly, it enables users to download supplementary data tables attached to publications. Further, KARAJ provides a parallel downloading framework powered by Aspera connect which reduces the downloading time significantly.

Keywords: Bioinformatics; Download; FASTQ; Genomics; Linux; biological data; sequence data; transcriptomics.

MeSH terms

Genome
Genomics
Metadata
Software*
Transcriptome*

Grants and funding

p3432/UNSW Sydney