Analysis of metagenomic data containing high biodiversity levels

PLoS One. 2013;8(3):e58118. doi: 10.1371/journal.pone.0058118. Epub 2013 Mar 7.

Abstract

In this paper we have addressed the problem of analysing Next Generation Sequencing samples with an expected large biodiversity content. We analysed several well-known 16S rRNA datasets from experimental samples, including both large and short sequences, in numbers of tens of thousands, in addition to carefully crafted synthetic datasets containing more than 7000 OTUs. From this data analysis several patterns were identified and used to develop new guidelines for experimentation in conditions of high biodiversity. We analysed the suitability of different clustering packages for these type of situations, the problem of even sampling, the relative effectiveness of Chao1 and ACE estimators as well as their effect on sampling size for a variety of population distributions. As regards practical analysis procedures, we advocated an approach that retains as much high-quality experimental data as possible. By carefully applying selection rules combining the taxonomic assignment with clustering strategies, we derived a set of recommendations for ultra-sequencing data analysis at high biodiversity levels.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Biodiversity*
  • Computational Biology
  • Databases, Nucleic Acid
  • High-Throughput Nucleotide Sequencing
  • Metagenomics*
  • RNA, Ribosomal, 16S / genetics

Substances

  • RNA, Ribosomal, 16S

Grants and funding

The authors wish to acknowledge the support provided by grant EGO22008 from the Spanish Ministry of Agriculture, Food and Environment for this research, as well as partial support from the Spanish Research Council (CSIC) PIE number 200420E397, and grants 510RT0391 (FreeBIT) from CYTED and BM1006 (SEQAHEAD) from the EU 7FP COST program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.