Improving the quality of genome, protein sequence, and taxonomy databases: a prerequisite for microbiome meta-omics 2.0

Proteomics. 2015 Oct;15(20):3418-23. doi: 10.1002/pmic.201500104. Epub 2015 Sep 10.

Abstract

High-throughput shotgun metaproteomic approaches on environmental or medical microbiomes are producing huge amounts of tandem mass spectrometry data. These can be interpreted either with a general protein sequence database comprising tens of thousands of sequenced genomes or with a more customized database such as those obtained after metagenome sequencing of the DNA extracted from the same sample. However, not all entries in a nucleotide or protein sequence database are of equal quality and this can critically impact metaproteomic data interpretation. In this viewpoint article, we exemplify several key issues. First, either genome or transcriptome data interpretation due to inaccurate contig assembly and gene prediction may be erroneous, for its mitigation the metaproteogenomic strategies could have an interesting perspective. Errors in sample handling and taxonomical characterization may also be problematic. Cross-contamination of genome sequences is also underestimated while frequent. As a consequence of these structural errors regarding protein sequences and additional problems due to homology-based functional annotation of proteins, specific efforts for better interpretation of metaproteomic data are required. We propose the development of new bioinformatic pipelines devoted to detection and correction of errors and contaminations to improve the overall quality of sequence and taxonomy databases for metaproteomics.

Keywords: Bioinformatics; Genomes; Metaproteomics; Microbiomes; Proteogenomics; Taxonomy.

MeSH terms

  • Amino Acid Sequence / genetics*
  • Classification
  • Computational Biology
  • Databases, Genetic
  • Genomics*
  • Metagenome / genetics
  • Microbiota / genetics*
  • Molecular Sequence Annotation
  • Proteomics*
  • Tandem Mass Spectrometry
  • Transcriptome / genetics