Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra

Vera Rieder; Karin U Schork; Laura Kerschke; Bernhard Blank-Landeshammer; Albert Sickmann; Jörg Rahnenführer

doi:10.1021/acs.jproteome.7b00427

Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra

J Proteome Res. 2017 Nov 3;16(11):4035-4044. doi: 10.1021/acs.jproteome.7b00427.

Authors

Vera Rieder¹, Karin U Schork^{1

2}, Laura Kerschke^{1

3}, Bernhard Blank-Landeshammer⁴, Albert Sickmann^{2

4

5}, Jörg Rahnenführer¹

Affiliations

¹ Department of Statistics, TU Dortmund University , 44221 Dortmund, Germany.
² Medizinische Fakultät, Medizinisches Proteom-Center, Ruhr-University Bochum , 44801 Bochum, Germany.
³ Institut für Biometrie und Klinische Forschung (IBKF) der Westfälischen Wilhelms-Universität und des Universitätsklinikums Münster , 48149 Münster, Germany.
⁴ Leibniz-Institut für Analytische Wissenschaften-ISAS - e.V. , 44139 Dortmund, Germany.
⁵ Department of Chemistry, College of Physical Sciences, University of Aberdeen , Aberdeen AB24 3FX, Scotland, United Kingdom.

PMID: 28959885
DOI: 10.1021/acs.jproteome.7b00427

Abstract

In proteomics, liquid chromatography-tandem mass spectrometry (LC-MS/MS) is established for identifying peptides and proteins. Duplicated spectra, that is, multiple spectra of the same peptide, occur both in single MS/MS runs and in large spectral libraries. Clustering tandem mass spectra is used to find consensus spectra, with manifold applications. First, it speeds up database searches, as performed for instance by Mascot. Second, it helps to identify novel peptides across species. Third, it is used for quality control to detect wrongly annotated spectra. We compare different clustering algorithms based on the cosine distance between spectra. CAST, MS-Cluster, and PRIDE Cluster are popular algorithms to cluster tandem mass spectra. We add well-known algorithms for large data sets, hierarchical clustering, DBSCAN, and connected components of a graph, as well as the new method N-Cluster. All algorithms are evaluated on real data with varied parameter settings. Cluster results are compared with each other and with peptide annotations based on validation measures such as purity. Quality control, regarding the detection of wrongly (un)annotated spectra, is discussed for exemplary resulting clusters. N-Cluster proves to be highly competitive. All clustering results benefit from the so-called DISMS2 filter that integrates additional information, for example, on precursor mass.

Keywords: clustering; tandem mass spectra.

Publication types

Comparative Study
Evaluation Study

MeSH terms

Algorithms*
Cluster Analysis*
Quality Control
Tandem Mass Spectrometry / methods*