Estimation of the proportion of true null hypotheses under sparse dependence: Adaptive FDR controlling in microarray data

Aniket Biswas; Subrata Chakraborty; Vishwa Jyoti Baruah

doi:10.1177/09622802221074164

Estimation of the proportion of true null hypotheses under sparse dependence: Adaptive FDR controlling in microarray data

Stat Methods Med Res. 2022 May;31(5):917-927. doi: 10.1177/09622802221074164. Epub 2022 Feb 8.

Authors

Aniket Biswas¹, Subrata Chakraborty¹, Vishwa Jyoti Baruah²

Affiliations

¹ Department of Statistics, 28675Dibrugarh University, Dibrugarh, Assam, India.
² Center for Biotechnology and Bioinformatics, 28675Dibrugarh University, Dibrugarh, Assam, India.

PMID: 35133933
DOI: 10.1177/09622802221074164

Abstract

The proportion of non-differentially expressed genes is an important quantity in microarray data analysis and an appropriate estimate of the same is used to construct adaptive multiple testing procedures. Most of the estimators for the proportion of true null hypotheses based on the thresholding, maximum likelihood and density estimation approaches assume independence among the gene expressions. Usually, sparse dependence structure is natural in modelling associations in microarray gene expression data and hence it is necessary to develop methods for accommodating the sparse dependence well within the framework of existing estimators. We propose a clustering based method to put genes in the same group that are not coexpressed using the estimated high dimensional correlation structure under sparse assumption as dissimilarity matrix. This novel method is applied to three existing estimators for the proportion of true null hypotheses. Extensive simulation study shows that the proposed method improves an existing estimator by making it less conservative and the corresponding adaptive Benjamini-Hochberg algorithm more powerful. The proposed method is applied to a microarray gene expression dataset of colorectal cancer patients and the results show gain in terms of number of differentially expressed genes. The R code is available at https://github.com/aniketstat/Proportiontion-of-true-null-under-sparse-dependence-2021.

Keywords: 62P10.; ANOVA; False discovery rate; Tocher’s method; differentially expressed genes; high dimensional data; sub-Gaussian family. MSC 2010: 62F99.

MeSH terms

Algorithms*
Computer Simulation
Gene Expression Profiling* / methods
Humans
Oligonucleotide Array Sequence Analysis / methods