Transcriptomics and machine learning to advance schizophrenia genetics: A case-control study using post-mortem brain data

Comput Methods Programs Biomed. 2022 Feb:214:106590. doi: 10.1016/j.cmpb.2021.106590. Epub 2021 Dec 16.

Abstract

Background and objective: Alterations of the expression of a variety of genes have been reported in patients with schizophrenia (SCZ). Moreover, machine learning (ML) analysis of gene expression microarray data has shown promising preliminary results in the study of SCZ. Our objective was to evaluate the performance of ML in classifying SCZ cases and controls based on gene expression microarray data from the dorsolateral prefrontal cortex.

Methods: We apply a state-of-the-art ML algorithm (XGBoost) to train and evaluate a classification model using 201 SCZ cases and 278 controls. We utilized 10-fold cross-validation for model selection, and a held-out testing set to evaluate the model. The performance metric utilizes to evaluate classification performance was the area under the receiver-operator characteristics curve (AUC).

Results: We report an average AUC on 10-fold cross-validation of 0.76 and an AUC of 0.76 on testing data, not used during training. Analysis of the rolling balanced classification accuracy from high to low prediction confidence levels showed that the most certain subset of predictions ranged between 80-90%. The ML model utilized 182 gene expression probes. Further improvement to classification performance was observed when applying an automated ML strategy on the 182 features, which achieved an AUC of 0.79 on the same testing data. We found literature evidence linking all of the top ten ML ranked genes to SCZ. Furthermore, we leveraged information from the full set of microarray gene expressions available via univariate differential gene expression analysis. We then prioritized differentially expressed gene sets using the piano gene set analysis package. We augmented the ranking of the prioritized gene sets with genes from the complex multivariate ML model using hypergeometric tests to identify more robust gene sets. We identified two significant Gene Ontology molecular function gene sets: "oxidoreductase activity, acting on the CH-NH2 group of donors" and "integrin binding." Lastly, we present candidate treatments for SCZ based on findings from our study CONCLUSIONS: Overall, we observed above-chance performance from ML classification of SCZ cases and controls based on brain gene expression microarray data, and found that ML analysis of gene expressions could further our understanding of the pathophysiology of SCZ and help identify novel treatments.

Keywords: Bioinformatics; Machine learning; Post-mortem; Schizophrenia; Transcriptomics.

MeSH terms

  • Brain
  • Case-Control Studies
  • Dorsolateral Prefrontal Cortex
  • Humans
  • Machine Learning
  • Schizophrenia* / genetics
  • Transcriptome