Features Selection and Extraction in Statistical Analysis of Proteomics Datasets

Methods Mol Biol. 2021:2361:143-159. doi: 10.1007/978-1-0716-1641-3_9.

Abstract

"Omics" techniques (e.g., proteomics, genomics, metabolomics), from which huge datasets can nowadays be obtained, require a different way of thinking about data analysis that can be summarized with the idea that, when data are enough, they can speak for themselves. Indeed, managing huge amounts of data imposes the replacement of the classical deductive approach (hypothesis-driven) with a data-driven hypothesis-generating inductive approach, so to generate mechanistical hypotheses from data.Data reduction is a crucial step in proteomics data analysis, because of the sparsity of significant features in big datasets. Thus, feature selection/extraction methods are applied to obtain a set of features based on which a proteomics signature can be drawn, with a functional significance (e.g., classification, diagnosis, prognosis). Despite big data generated almost daily by proteomics studies, a well-established statistical workflow for data analysis in proteomics is still lacking, opening up to misleading and incorrect data analysis and interpretation. This chapter will give an overview of the methods available for feature selection/extraction in proteomics datasets and how to choose the most appropriate one based on the type of dataset.

Keywords: Cross-validation; Discriminant analysis; Features extraction; Features selection; Principal component analysis; Proteomics; Signature; Sparsity; Supervised/unsupervised methods; Univariate/multivariate methods.

MeSH terms

  • Genomics
  • Metabolomics
  • Proteomics*
  • Research Design
  • Workflow