Statistical analysis of proteomics data: A review on feature selection

Marta Lualdi; Mauro Fasano

doi:10.1016/j.jprot.2018.12.004

Statistical analysis of proteomics data: A review on feature selection

J Proteomics. 2019 Apr 30:198:18-26. doi: 10.1016/j.jprot.2018.12.004. Epub 2018 Dec 6.

Authors

Marta Lualdi¹, Mauro Fasano²

Affiliations

¹ Department of Science and High Technology (DiSAT), University of Insubria, Busto Arsizio, Italy.
² Department of Science and High Technology (DiSAT), University of Insubria, Busto Arsizio, Italy. Electronic address: mauro.fasano@uninsubria.it.

PMID: 30529743
DOI: 10.1016/j.jprot.2018.12.004

Abstract

The spread of "-omics" strategies has strongly changed the way of thinking about the scientific method. Indeed, managing huge amounts of data imposes the replacement of the classical deductive approach with a data-driven inductive approach, so to generate mechanistical hypotheses from data. Data reduction is a crucial step in the process of proteomics data analysis, because of the sparsity of significant features in big datasets. Thus, feature selection methods are applied to obtain a set of features based on which a proteomics signature can be drawn, with a functional significance (e.g., classification, diagnosis, prognosis). In this frame, the aim of the present review article is to give an overview of the methods available for proteomics data analysis, with a focus on biomedical translational research. Suggestions for the choice of the most appropriate standard statistical procedures are presented to perform data reduction by feature selection, cross-validation and functional analysis of proteomics profiles. SIGNIFICANCE: The proteome, including all so-called "proteoforms", represents the highest level of complexity of biomolecules when compared to the other "-omes" (i.e., genome, transcriptome). For this reason, the use of proper data reduction strategies is mandatory for proteomics data analysis. However, the strategies to be employed for feature selection must be carefully chosen, since many different approaches exist based on both input data and desired output. So far, a well-established decision-making workflow for proteomics data analysis is lacking, opening up to misleading and incorrect data analysis and interpretation. In this review article many statistical approaches are described and compared for their application in the field of biomedical research, in order to suggest the reader the most suitable analysis pathway and to avoid mistakes.

Keywords: Dimensionality and Sparsity; Feature selection; Inductive reasoning; Proteomics signature.

Publication types

Review

MeSH terms

Animals
Data Interpretation, Statistical
Electronic Data Processing* / methods
Electronic Data Processing* / trends
Humans
Proteomics* / methods
Proteomics* / trends
Translational Research, Biomedical* / methods
Translational Research, Biomedical* / trends