CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research

J Comput Biol. 2019 Apr;26(4):376-386. doi: 10.1089/cmb.2018.0238. Epub 2019 Feb 21.

Abstract

The employment of machine learning (ML) approaches to extract gene expression information from microarray studies has increased in the past years, specially on cancer-related works. However, despite this continuous interest in applying ML in cancer biomedical research, there are no curated repositories focused only on providing quality data sets exclusively for benchmarking and testing of such techniques for cancer research. Thus, in this work, we present the Curated Microarray Database (CuMiDa), a database composed of 78 handpicked microarray data sets for Homo sapiens that were carefully examined from more than 30,000 microarray experiments from the Gene Expression Omnibus using a rigorous filtering criteria. All data sets were individually submitted to background correction, normalization, sample quality analysis and were manually edited to eliminate erroneous probes. All data sets were tested using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) analyses to observe sample division and were additionally tested using various ML approaches to provide a base accuracy for the major techniques employed for microarray data sets. CuMiDa is a database created solely for benchmarking and testing of ML approaches applied to cancer research.

Keywords: benchmarking; cancer; classification; curation; machine learning; microarray; supervised learning; unsupervised learning.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Benchmarking
  • Computational Biology / methods
  • Data Curation / methods*
  • Gene Expression Profiling / methods*
  • Humans
  • Neoplasms / genetics*
  • Oligonucleotide Array Sequence Analysis
  • Principal Component Analysis
  • Unsupervised Machine Learning