Evaluating de novo sequencing in proteomics: already an accurate alternative to database-driven peptide identification?

Brief Bioinform. 2018 Sep 28;19(5):954-970. doi: 10.1093/bib/bbx033.

Abstract

While peptide identifications in mass spectrometry (MS)-based shotgun proteomics are mostly obtained using database search methods, high-resolution spectrum data from modern MS instruments nowadays offer the prospect of improving the performance of computational de novo peptide sequencing. The major benefit of de novo sequencing is that it does not require a reference database to deduce full-length or partial tag-based peptide sequences directly from experimental tandem mass spectrometry spectra. Although various algorithms have been developed for automated de novo sequencing, the prediction accuracy of proposed solutions has been rarely evaluated in independent benchmarking studies. The main objective of this work is to provide a detailed evaluation on the performance of de novo sequencing algorithms on high-resolution data. For this purpose, we processed four experimental data sets acquired from different instrument types from collision-induced dissociation and higher energy collisional dissociation (HCD) fragmentation mode using the software packages Novor, PEAKS and PepNovo. Moreover, the accuracy of these algorithms is also tested on ground truth data based on simulated spectra generated from peak intensity prediction software. We found that Novor shows the overall best performance compared with PEAKS and PepNovo with respect to the accuracy of correct full peptide, tag-based and single-residue predictions. In addition, the same tool outpaced the commercial competitor PEAKS in terms of running time speedup by factors of around 12-17. Despite around 35% prediction accuracy for complete peptide sequences on HCD data sets, taken as a whole, the evaluated algorithms perform moderately on experimental data but show a significantly better performance on simulated data (up to 84% accuracy). Further, we describe the most frequently occurring de novo sequencing errors and evaluate the influence of missing fragment ion peaks and spectral noise on the accuracy. Finally, we discuss the potential of de novo sequencing for now becoming more widely used in the field.

Publication types

  • Evaluation Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Amino Acid Sequence
  • Animals
  • Computational Biology / methods
  • Computer Simulation
  • Databases, Protein / statistics & numerical data
  • Humans
  • Mice
  • Peptides / chemistry
  • Proteomics / methods*
  • Proteomics / statistics & numerical data
  • Pyrococcus furiosus / genetics
  • Saccharomyces cerevisiae / genetics
  • Sequence Analysis, Protein / methods*
  • Sequence Analysis, Protein / statistics & numerical data
  • Sequence Tagged Sites
  • Software
  • Tandem Mass Spectrometry / methods
  • Tandem Mass Spectrometry / statistics & numerical data

Substances

  • Peptides