High-throughput parallel proteogenomics: a bacterial case study

Proteomics. 2014 Dec;14(23-24):2780-9. doi: 10.1002/pmic.201400185.

Abstract

In recent years, a new paradigm for genome annotation has emerged, termed "proteogenomics," that leverages peptide MS to annotate a genome. This is achieved by mapping peptides to a six-frame translation of a genome, including available splice databases, which may suggest refinements to gene models. Using this approach, it is possible to refine gene regions such as exon boundaries, novel genes, gene boundaries, frame shifts, reverse strands, translated UTRs, and novel splice junctions. One of the challenges of proteogenomics is how best to (1) tackle assigning confidence to any resulting annotation and (2) apply these gene model refinements, either through manual annotation or through an automated process via training gene prediction tools. This is not a straightforward process, as many gene prediction tools have their defined suitability for niche genomes (either eukaryotic or prokaryotic) trained on and refined with model organisms such as Arabidopsis thaliana and Escherichia coli, and varying degrees of features that can leverage the use of external evidence. In this study, we outline a suitable approach toward preprocessing mass spectra and optimizing the MS/MS search for a given dataset. We also discuss future challenges, which continue to pose a problem in the field of proteogenomics, and better strategies to successfully tackle them with, using existing tools. We use Bradyrhizobium diazoefficiens (Nitrogen-fixing bacteria), with a 9.1 Mb genome as a case study, utilizing the latest in second-generation proteogenomics tools with multiple gene models for cross-validation of proteogenomics annotations.

Keywords: Bioinformatics; Bradyrhizobium diazoefficiens; Proteogenomics.

MeSH terms

  • Arabidopsis / metabolism
  • Bradyrhizobium / metabolism*
  • Computational Biology / methods*
  • Escherichia coli / metabolism
  • Proteomics / methods*