Disease category-specific annotation of variants using an ensemble learning framework

Brief Bioinform. 2022 Jan 17;23(1):bbab438. doi: 10.1093/bib/bbab438.

Abstract

Understanding the impact of non-coding sequence variants on complex diseases is an essential problem. We present a novel ensemble learning framework-CASAVA, to predict genomic loci in terms of disease category-specific risk. Using disease-associated variants identified by GWAS as training data, and diverse sequencing-based genomics and epigenomics profiles as features, CASAVA provides risk prediction of 24 major categories of diseases throughout the human genome. Our studies showed that CASAVA scores at a genomic locus provide a reasonable prediction of the disease-specific and disease category-specific risk prediction for non-coding variants located within the locus. Taking MHC2TA and immune system diseases as an example, we demonstrate the potential of CASAVA in revealing variant-disease associations. A website (http://zhanglabtools.org/CASAVA) has been built to facilitate easily access to CASAVA scores.

Keywords: complex disease; disease category; ensemble learning; functional annotation; non-coding variant.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Genome, Human
  • Genome-Wide Association Study*
  • Genomics
  • Humans
  • Machine Learning
  • Polymorphism, Single Nucleotide*