VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data

Arash Bayat; Piotr Szul; Aidan R O'Brien; Robert Dunne; Brendan Hosking; Yatish Jain; Cameron Hosking; Oscar J Luo; Natalie Twine; Denis C Bauer

doi:10.1093/gigascience/giaa077

VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data

Gigascience. 2020 Aug 1;9(8):giaa077. doi: 10.1093/gigascience/giaa077.

Authors

Affiliations

¹ Health and Biosecurity, Commonwealth Scientific and Industrial Research Organisation (CSIRO), 11 Julius Ave North Ryde NSW 2113 Australia.
² Data61, Commonwealth Scientific and Industrial Research Organisation (CSIRO), 5 Garden St Eveleigh NSW 2015 Australia.
³ Department of Systems Biomedical Sciences, School of Medicine, Jinan University, 601 Huangpu Ave, Guangzhou, Guangdong Province, China.
⁴ Department of Biomedical Sciences, Macquarie University NSW 2109 Australia.

Abstract

Background: Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions ais found in small datasets, large datasets have not been processed yet owing to the high computational complexity of the search for epistatic interactions.

Findings: We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to the whole genome of population-scale datasets with 100,000,000 genomic variants and 100,000 samples.

Conclusions: Compared with traditional monogenic genome-wide association studies, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high-dimensional genomic data in a manageable time.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Cloud Computing*
Genome-Wide Association Study*
Genomics
Machine Learning
Phenotype