Generative Adversarial Networks for Creating Synthetic Nucleic Acid Sequences of Cat Genome

Int J Mol Sci. 2022 Mar 28;23(7):3701. doi: 10.3390/ijms23073701.

Abstract

Nucleic acids are the basic units of deoxyribonucleic acid (DNA) sequencing. Every organism demonstrates different DNA sequences with specific nucleotides. It reveals the genetic information carried by a particular DNA segment. Nucleic acid sequencing expresses the evolutionary changes among organisms and revolutionizes disease diagnosis in animals. This paper proposes a generative adversarial networks (GAN) model to create synthetic nucleic acid sequences of the cat genome tuned to exhibit specific desired properties. We obtained the raw sequence data from Illumina next generation sequencing. Various data preprocessing steps were performed using Cutadapt and DADA2 tools. The processed data were fed to the GAN model that was designed following the architecture of Wasserstein GAN with gradient penalty (WGAN-GP). We introduced a predictor and an evaluator in our proposed GAN model to tune the synthetic sequences to acquire certain realistic properties. The predictor was built for extracting samples with a promoter sequence, and the evaluator was built for filtering samples that scored high for motif-matching. The filtered samples were then passed to the discriminator. We evaluated our model based on multiple metrics and demonstrated outputs for latent interpolation, latent complementation, and motif-matching. Evaluation results showed our proposed GAN model achieved 93.7% correlation with the original data and produced significant outcomes as compared to existing models for sequence generation.

Keywords: WGAN-GP; cat genome; generative adversarial networks; motif matching; nucleic acid sequences; promoter classification; promoter prediction; synthetic genome.

MeSH terms

  • Adenosine Deaminase*
  • DNA
  • Image Processing, Computer-Assisted* / methods
  • Intercellular Signaling Peptides and Proteins

Substances

  • Intercellular Signaling Peptides and Proteins
  • DNA
  • Adenosine Deaminase