MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics

J Proteome Res. 2016 Mar 4;15(3):713-20. doi: 10.1021/acs.jproteome.5b00749. Epub 2016 Jan 12.

Abstract

Shotgun proteomics experiments generate large amounts of fragment spectra as primary data, normally with high redundancy between and within experiments. Here, we have devised a clustering technique to identify fragment spectra stemming from the same species of peptide. This is a powerful alternative method to traditional search engines for analyzing spectra, specifically useful for larger scale mass spectrometry studies. As an aid in this process, we propose a distance calculation relying on the rarity of experimental fragment peaks, following the intuition that peaks shared by only a few spectra offer more evidence than peaks shared by a large number of spectra. We used this distance calculation and a complete-linkage scheme to cluster data from a recent large-scale mass spectrometry-based study. The clusterings produced by our method have up to 40% more identified peptides for their consensus spectra compared to those produced by the previous state-of-the-art method. We see that our method would advance the construction of spectral libraries as well as serve as a tool for mining large sets of fragment spectra. The source code and Ubuntu binary packages are available at https://github.com/statisticalbiotechnology/maracluster (under an Apache 2.0 license).

Keywords: Mass spectrometry; bioinformatics; database search; hierarchical clustering; proteomics; spectral archives; spectral libraries.

MeSH terms

  • Animals
  • Cluster Analysis*
  • Data Mining
  • Humans
  • Mass Spectrometry / methods*
  • Peptides / analysis*
  • Proteomics / methods*
  • Search Engine
  • Software

Substances

  • Peptides