Deep convolutional and conditional neural networks for large-scale genomic data generation

Burak Yelmen; Aurélien Decelle; Leila Lea Boulos; Antoine Szatkownik; Cyril Furtlehner; Guillaume Charpiat; Flora Jay

doi:10.1371/journal.pcbi.1011584

Deep convolutional and conditional neural networks for large-scale genomic data generation

PLoS Comput Biol. 2023 Oct 30;19(10):e1011584. doi: 10.1371/journal.pcbi.1011584. eCollection 2023 Oct.

Authors

Burak Yelmen^{1

2}, Aurélien Decelle^{1

3}, Leila Lea Boulos^{1

4}, Antoine Szatkownik¹, Cyril Furtlehner¹, Guillaume Charpiat¹, Flora Jay¹

Affiliations

¹ Université Paris-Saclay, CNRS, INRIA, LISN, Paris, France.
² University of Tartu, Institute of Genomics, Tartu, Estonia.
³ Universidad Complutense de Madrid, Departamento de Física Teórica, Madrid, Spain.
⁴ Université d'Évry Val-d'Essonne, Évry-Courcouronnes, France.

Abstract

Applications of generative models for genomic data have gained significant momentum in the past few years, with scopes ranging from data characterization to generation of genomic segments and functional sequences. In our previous study, we demonstrated that generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be used to create novel high-quality artificial genomes (AGs) which can preserve the complex characteristics of real genomes such as population structure, linkage disequilibrium and selection signals. However, a major drawback of these models is scalability, since the large feature space of genome-wide data increases computational complexity vastly. To address this issue, we implemented a novel convolutional Wasserstein GAN (WGAN) model along with a novel conditional RBM (CRBM) framework for generating AGs with high SNP number. These networks implicitly learn the varying landscape of haplotypic structure in order to capture complex correlation patterns along the genome and generate a wide diversity of plausible haplotypes. We performed comparative analyses to assess both the quality of these generated haplotypes and the amount of possible privacy leakage from the training data. As the importance of genetic privacy becomes more prevalent, the need for effective privacy protection measures for genomic data increases. We used generative neural networks to create large artificial genome segments which possess many characteristics of real genomes without substantial privacy leakage from the training dataset. In the near future, with further improvements in haplotype quality and privacy preservation, large-scale artificial genome databases can be assembled to provide easily accessible surrogates of real databases, allowing researchers to conduct studies with diverse genomic data within a safe ethical framework in terms of donor privacy.

Copyright: © 2023 Yelmen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Databases, Factual
Genomics*
Haplotypes
Learning*
Neural Networks, Computer

Grants and funding

This work was funded by the Agence Nationale de la Recherche through grant ANR-20-CE45-0010-01 RoDAPoG (B.Y., L.B., A.S., C.F., G.C., F.J.); the Comunidad de Madrid and the Complutense University of Madrid (Spain) through the Atracción de Talento programs (Refs. 2019-T1/TIC-13298), the Banco Santander and the UCM (grant PR44/21-29937), and the Ministerio de Economía y Competitividad, Agencia Estatal de Investigación and Fondo Europeo de Desarrollo Regional (FEDER) (Spain and European Union) through grant PID2021-125506NA-I00 (A.D.); Labex DigiCosme (project ANR-11-LABEX-0045-DIGICOSME) operated by ANR as part of the program "Investissement d’Avenir" Idex Paris-Saclay (ANR-11-IDEX-0003-02) (L.B.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.