Mining Large Scale Tandem Mass Spectrometry Data for Protein Modifications Using Spectral Libraries

J Proteome Res. 2016 Mar 4;15(3):721-31. doi: 10.1021/acs.jproteome.5b00877. Epub 2015 Dec 31.

Abstract

Experimental improvements in post-translational modification (PTM) detection by tandem mass spectrometry (MS/MS) has allowed the identification of vast numbers of PTMs. Open modification searches (OMSs) of MS/MS data, which do not require prior knowledge of the modifications present in the sample, further increased the diversity of detected PTMs. Despite much effort, there is still a lack of functional annotation of PTMs. One possibility to narrow the annotation gap is to mine MS/MS data deposited in public repositories and to correlate the PTM presence with biological meta-information attached to the data. Since the data volume can be quite substantial and contain tens of millions of MS/MS spectra, the data mining tools must be able to cope with big data. Here, we present two tools, Liberator and MzMod, which are built using the MzJava class library and the Apache Spark large scale computing framework. Liberator builds large MS/MS spectrum libraries, and MzMod searches them in an OMS mode. We applied these tools to a recently published set of 25 million spectra from 30 human tissues and present tissue specific PTMs. We also compared the results to the ones obtained with the OMS tool MODa and the search engine X!Tandem.

Keywords: Apache Spark; Hadoop; MS/MS; PTM; big data; human tissues; open modification search; parallel computing; proteomics.

MeSH terms

  • Data Mining / methods*
  • Databases, Protein*
  • Humans
  • Protein Processing, Post-Translational*
  • Proteomics / methods
  • Search Engine
  • Software
  • Tandem Mass Spectrometry / methods