Features Selection and Extraction in Statistical Analysis of Proteomics Datasets

Marta Lualdi; Mauro Fasano

doi:10.1007/978-1-0716-1641-3_9

Features Selection and Extraction in Statistical Analysis of Proteomics Datasets

Methods Mol Biol. 2021:2361:143-159. doi: 10.1007/978-1-0716-1641-3_9.

Authors

Marta Lualdi¹, Mauro Fasano²

Affiliations

¹ Department of Science and High Technology, Center of Bioinformatics, University of Insubria, Busto Arsizio, Italy.
² Department of Science and High Technology, Center of Bioinformatics, University of Insubria, Busto Arsizio, Italy. mauro.fasano@uninsubria.it.

PMID: 34236660
DOI: 10.1007/978-1-0716-1641-3_9

Abstract

"Omics" techniques (e.g., proteomics, genomics, metabolomics), from which huge datasets can nowadays be obtained, require a different way of thinking about data analysis that can be summarized with the idea that, when data are enough, they can speak for themselves. Indeed, managing huge amounts of data imposes the replacement of the classical deductive approach (hypothesis-driven) with a data-driven hypothesis-generating inductive approach, so to generate mechanistical hypotheses from data.Data reduction is a crucial step in proteomics data analysis, because of the sparsity of significant features in big datasets. Thus, feature selection/extraction methods are applied to obtain a set of features based on which a proteomics signature can be drawn, with a functional significance (e.g., classification, diagnosis, prognosis). Despite big data generated almost daily by proteomics studies, a well-established statistical workflow for data analysis in proteomics is still lacking, opening up to misleading and incorrect data analysis and interpretation. This chapter will give an overview of the methods available for feature selection/extraction in proteomics datasets and how to choose the most appropriate one based on the type of dataset.

Keywords: Cross-validation; Discriminant analysis; Features extraction; Features selection; Principal component analysis; Proteomics; Signature; Sparsity; Supervised/unsupervised methods; Univariate/multivariate methods.

MeSH terms

Genomics
Metabolomics
Proteomics*
Research Design
Workflow