Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset

BMC Bioinformatics. 2019 May 28;20(1):268. doi: 10.1186/s12859-019-2855-9.

Abstract

Background: Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher.

Results: We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project.

Conclusions: Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package.

Keywords: Batch correction; Batch effect; ComBat; GTEx; Gene expression; Principal component analysis.

MeSH terms

  • Algorithms*
  • Area Under Curve
  • Computational Biology / methods*
  • Databases, Genetic*
  • Epistasis, Genetic*
  • Gene Expression Regulation
  • Genes*
  • Humans
  • ROC Curve
  • Subcutaneous Fat / metabolism