Investigating differential abundance methods in microbiome data: A benchmark study

PLoS Comput Biol. 2022 Sep 8;18(9):e1010467. doi: 10.1371/journal.pcbi.1010467. eCollection 2022 Sep.

Abstract

The development of increasingly efficient and cost-effective high throughput DNA sequencing techniques has enhanced the possibility of studying complex microbial systems. Recently, researchers have shown great interest in studying the microorganisms that characterise different ecological niches. Differential abundance analysis aims to find the differences in the abundance of each taxa between two classes of subjects or samples, assigning a significance value to each comparison. Several bioinformatic methods have been specifically developed, taking into account the challenges of microbiome data, such as sparsity, the different sequencing depth constraint between samples and compositionality. Differential abundance analysis has led to important conclusions in different fields, from health to the environment. However, the lack of a known biological truth makes it difficult to validate the results obtained. In this work we exploit metaSPARSim, a microbial sequencing count data simulator, to simulate data with differential abundance features between experimental groups. We perform a complete comparison of recently developed and established methods on a common benchmark with great effort to the reliability of both the simulated scenarios and the evaluation metrics. The performance overview includes the investigation of numerous scenarios, studying the effect on methods' results on the main covariates such as sample size, percentage of differentially abundant features, sequencing depth, feature variability, normalisation approach and ecological niches. Mainly, we find that methods show a good control of the type I error and, generally, also of the false discovery rate at high sample size, while recall seem to depend on the dataset and sample size.

Publication types

  • Review

MeSH terms

  • Benchmarking*
  • Computational Biology / methods
  • High-Throughput Nucleotide Sequencing / methods
  • Microbiota*
  • Reproducibility of Results

Grants and funding

This work has been supported by the SEED Project "tRajectoriEs of baCtErial NeTwoRks from hEalthy to disease state and back (RECENTRE)" funded by the Department of Information Engineering of the University of Padova, Grants nr. DI_C_BIRD2020_01 (BDC). G.B. was founded by PON 'Ricerca e Innovazione' 2014-2020. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.