Outgroup Machine Learning Approach Identifies Single Nucleotide Variants in Noncoding DNA Associated with Autism Spectrum Disorder

Pac Symp Biocomput. 2019:24:260-271.

Abstract

Autism spectrum disorder (ASD) is a heritable neurodevelopmental disorder affecting 1 in 59 children. While noncoding genetic variation has been shown to play a major role in many complex disorders, the contribution of these regions to ASD susceptibility remains unclear. Genetic analyses of ASD typically use unaffected family members as controls; however, we hypothesize that this method does not effectively elevate variant signal in the noncoding region due to family members having subclinical phenotypes arising from common genetic mechanisms. In this study, we use a separate, unrelated outgroup of individuals with progressive supranuclear palsy (PSP), a neurodegenerative condition with no known etiological overlap with ASD, as a control population. We use whole genome sequencing data from a large cohort of 2182 children with ASD and 379 controls with PSP, sequenced at the same facility with the same machines and variant calling pipeline, in order to investigate the role of noncoding variation in the ASD phenotype. We analyze seven major types of noncoding variants: microRNAs, human accelerated regions, hypersensitive sites, transcription factor binding sites, DNA repeat sequences, simple repeat sequences, and CpG islands. After identifying and removing batch effects between the two groups, we trained an ℓ1-regularized logistic regression classifier to predict ASD status from each set of variants. The classifier trained on simple repeat sequences performed well on a held-out test set (AUC-ROC = 0.960); this classifier was also able to differentiate ASD cases from controls when applied to a completely independent dataset (AUC-ROC = 0.960). This suggests that variation in simple repeat regions is predictive of the ASD phenotype and may contribute to ASD risk. Our results show the importance of the noncoding region and the utility of independent control groups in effectively linking genetic variation to disease phenotype for complex disorders.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Autism Spectrum Disorder / genetics*
  • Case-Control Studies
  • Child
  • Cohort Studies
  • Computational Biology
  • CpG Islands
  • DNA / genetics*
  • Female
  • Gene Regulatory Networks
  • Genetic Association Studies
  • Genetic Predisposition to Disease
  • Genetic Variation*
  • Humans
  • Logistic Models
  • Machine Learning*
  • Male
  • MicroRNAs / genetics
  • Microsatellite Repeats
  • Phenotype
  • Polymorphism, Single Nucleotide
  • RNA, Untranslated / genetics
  • Supranuclear Palsy, Progressive / genetics
  • Whole Genome Sequencing

Substances

  • MicroRNAs
  • RNA, Untranslated
  • DNA