An optimized data structure for high-throughput 3D proteomics data: mzRTree

Sara Nasso; Francesco Silvestri; Francesco Tisiot; Barbara Di Camillo; Andrea Pietracaprina; Gianna Maria Toffolo

doi:10.1016/j.jprot.2010.02.006

An optimized data structure for high-throughput 3D proteomics data: mzRTree

J Proteomics. 2010 Apr 18;73(6):1176-82. doi: 10.1016/j.jprot.2010.02.006. Epub 2010 Feb 16.

Authors

Sara Nasso¹, Francesco Silvestri, Francesco Tisiot, Barbara Di Camillo, Andrea Pietracaprina, Gianna Maria Toffolo

Affiliation

¹ Department of Information Engineering, University of Padova, Padova Italy. sara.nasso@dei.unipd.it

PMID: 20167298
DOI: 10.1016/j.jprot.2010.02.006

Abstract

As an emerging field, MS-based proteomics still requires software tools for efficiently storing and accessing experimental data. In this work, we focus on the management of LC-MS data, which are typically made available in standard XML-based portable formats. The structures that are currently employed to manage these data can be highly inefficient, especially when dealing with high-throughput profile data. LC-MS datasets are usually accessed through 2D range queries. Optimizing this type of operation could dramatically reduce the complexity of data analysis. We propose a novel data structure for LC-MS datasets, called mzRTree, which embodies a scalable index based on the R-tree data structure. mzRTree can be efficiently created from the XML-based data formats and it is suitable for handling very large datasets. We experimentally show that, on all range queries, mzRTree outperforms other known structures used for LC-MS data, even on those queries these structures are optimized for. Besides, mzRTree is also more space efficient. As a result, mzRTree reduces data analysis computational costs for very large profile datasets.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Chromatography, Liquid / methods
Computational Biology / methods*
Humans
Imaging, Three-Dimensional
Mass Spectrometry / methods
Programming Languages
Proteome
Proteomics / methods*
Reproducibility of Results
Software
Time Factors

Substances

Proteome