Availability of MudPIT data for classification of biological samples

Dario Di Silvestre; Italo Zoppis; Francesca Brambilla; Valeria Bellettato; Giancarlo Mauri; Pierluigi Mauri

doi:10.1186/2043-9113-3-1

Availability of MudPIT data for classification of biological samples

J Clin Bioinforma. 2013 Jan 14;3(1):1. doi: 10.1186/2043-9113-3-1.

Authors

Dario Di Silvestre^#¹, Italo Zoppis^#², Francesca Brambilla¹, Valeria Bellettato¹, Giancarlo Mauri², Pierluigi Mauri¹

Affiliations

¹ , Institute for Biomedical Technologies (ITB-CNR), via F.lli Cervi 93, Segrate (Milan), Italy.
² Department of Informatics, Systems and Communication, Viale Sarca 336, University of Milano-Bicocca, Milan, Italy.

^# Contributed equally.

Abstract

Background: Mass spectrometry is an important analytical tool for clinical proteomics. Primarily employed for biomarker discovery, it is increasingly used for developing methods which may help to provide unambiguous diagnosis of biological samples. In this context, we investigated the classification of phenotypes by applying support vector machine (SVM) on experimental data obtained by MudPIT approach. In particular, we compared the performance capabilities of SVM by using two independent collection of complex samples and different data-types, such as mass spectra (m/z), peptides and proteins.

Results: Globally, protein and peptide data allowed a better discriminant informative content than experimental mass spectra (overall accuracy higher than 87% in both collection 1 and 2). These results indicate that sequencing of peptides and proteins reduces the experimental noise affecting the raw mass spectra, and allows the extraction of more informative features available for the effective classification of samples. In addition, proteins and peptides features selected by SVM matched for 80% with the differentially expressed proteins identified by the MAProMa software.

Conclusions: These findings confirm the availability of the most label-free quantitative methods based on processing of spectral count and SEQUEST-based SCORE values. On the other hand, it stresses the usefulness of MudPIT data for a correct grouping of sample phenotypes, by applying both supervised and unsupervised learning algorithms. This capacity permit the evaluation of actual samples and it is a good starting point to translate proteomic methodology to clinical application.