Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning

Nat Commun. 2021 Mar 8;12(1):1504. doi: 10.1038/s41467-021-21790-4.

Abstract

Elucidating functionality in non-coding regions is a key challenge in human genomics. It has been shown that intolerance to variation of coding and proximal non-coding sequence is a strong predictor of human disease relevance. Here, we integrate intolerance to variation, functional genomic annotations and primary genomic sequence to build JARVIS: a comprehensive deep learning model to prioritize non-coding regions, outperforming other human lineage-specific scores. Despite being agnostic to evolutionary conservation, JARVIS performs comparably or outperforms conservation-based scores in classifying pathogenic single-nucleotide and structural variants. In constructing JARVIS, we introduce the genome-wide residual variation intolerance score (gwRVIS), applying a sliding-window approach to whole genome sequencing data from 62,784 individuals. gwRVIS distinguishes Mendelian disease genes from more tolerant CCDS regions and highlights ultra-conserved non-coding elements as the most intolerant regions in the human genome. Both JARVIS and gwRVIS capture previously inaccessible human-lineage constraint information and will enhance our understanding of the non-coding genome.

MeSH terms

  • DNA, Intergenic
  • Deep Learning*
  • Genetic Variation
  • Genome, Human*
  • Genomics*
  • Humans
  • Sequence Analysis, DNA
  • Whole Genome Sequencing

Substances

  • DNA, Intergenic