A Gene Set-Integrated Approach for Predicting Disease-Associated Genes

IEEE/ACM Trans Comput Biol Bioinform. 2023 Nov-Dec;20(6):3440-3450. doi: 10.1109/TCBB.2022.3214517. Epub 2023 Dec 25.

Abstract

It is important to identify disease-associated genes for studying the pathogenic mechanism of complex diseases. Recently, models for disease gene prediction are dominantly based on molecular expression data and networks, including gene expression, protein expression, co-expression networks, protein-protein interaction networks, etc. One limitation of these methods is that they do not consider the knowledge of annotated gene sets representing known pathways or functionally-related sets of genes. In this study, we propose a new approach to predict disease-associated genes by integrating annotated gene sets data from the Molecular Signature Database (MSigDB). It first represents and integrates the different types of annotated gene sets in the MSigDB database in the form of the signal matrix. It then uses the signal matrix as the gene feature to train the disease gene prediction model. We compare our method with existing methods in predicting genes for five complex diseases. The results show that our method is superior to other methods. Further, we perform a case study on autism spectrum disorder (ASD). We find that ASD predictions are associated with ASD based on the statistical analysis of biological networks and independent ASD studies. The source code, prediction results and datasets are publicly available on https://github.com/genemine/GSI.git.

MeSH terms

  • Autism Spectrum Disorder* / genetics
  • Databases, Genetic
  • Humans
  • Protein Interaction Maps / genetics
  • Software