Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset

Judith Somekh; Shai S Shen-Orr; Isaac S Kohane

doi:10.1186/s12859-019-2855-9

Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset

BMC Bioinformatics. 2019 May 28;20(1):268. doi: 10.1186/s12859-019-2855-9.

Authors

Judith Somekh^{1

2

3}, Shai S Shen-Orr⁴, Isaac S Kohane⁵

Affiliations

¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. judith_somekh@is.haifa.ac.il.
² Faculty of Medicine, Technion - Israel Institute of Technology, Haifa, Israel. judith_somekh@is.haifa.ac.il.
³ Department of Information Systems, University of Haifa, Haifa, Israel. judith_somekh@is.haifa.ac.il.
⁴ Faculty of Medicine, Technion - Israel Institute of Technology, Haifa, Israel.
⁵ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

Abstract

Background: Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher.

Results: We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project.

Conclusions: Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package.

Keywords: Batch correction; Batch effect; ComBat; GTEx; Gene expression; Principal component analysis.

MeSH terms

Algorithms*
Area Under Curve
Computational Biology / methods*
Databases, Genetic*
Epistasis, Genetic*
Gene Expression Regulation
Genes*
Humans
ROC Curve
Subcutaneous Fat / metabolism