Improving the quality of genome, protein sequence, and taxonomy databases: a prerequisite for microbiome meta-omics 2.0

Olivier Pible; Jean Armengaud

doi:10.1002/pmic.201500104

Improving the quality of genome, protein sequence, and taxonomy databases: a prerequisite for microbiome meta-omics 2.0

Proteomics. 2015 Oct;15(20):3418-23. doi: 10.1002/pmic.201500104. Epub 2015 Sep 10.

Authors

Olivier Pible¹, Jean Armengaud¹

Affiliation

¹ CEA-Marcoule, DSV/IBITEC-S/SPI/Li2D, Laboratory "Innovative technologies for Detection and Diagnostics", Bagnols-sur-Cèze, France.

PMID: 26038180
DOI: 10.1002/pmic.201500104

Abstract

High-throughput shotgun metaproteomic approaches on environmental or medical microbiomes are producing huge amounts of tandem mass spectrometry data. These can be interpreted either with a general protein sequence database comprising tens of thousands of sequenced genomes or with a more customized database such as those obtained after metagenome sequencing of the DNA extracted from the same sample. However, not all entries in a nucleotide or protein sequence database are of equal quality and this can critically impact metaproteomic data interpretation. In this viewpoint article, we exemplify several key issues. First, either genome or transcriptome data interpretation due to inaccurate contig assembly and gene prediction may be erroneous, for its mitigation the metaproteogenomic strategies could have an interesting perspective. Errors in sample handling and taxonomical characterization may also be problematic. Cross-contamination of genome sequences is also underestimated while frequent. As a consequence of these structural errors regarding protein sequences and additional problems due to homology-based functional annotation of proteins, specific efforts for better interpretation of metaproteomic data are required. We propose the development of new bioinformatic pipelines devoted to detection and correction of errors and contaminations to improve the overall quality of sequence and taxonomy databases for metaproteomics.

Keywords: Bioinformatics; Genomes; Metaproteomics; Microbiomes; Proteogenomics; Taxonomy.

MeSH terms

Amino Acid Sequence / genetics*
Classification
Computational Biology
Databases, Genetic
Genomics*
Metagenome / genetics
Microbiota / genetics*
Molecular Sequence Annotation
Proteomics*
Tandem Mass Spectrometry
Transcriptome / genetics