CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research

Bruno César Feltes; Eduardo Bassani Chandelier; Bruno Iochins Grisci; Márcio Dorn

doi:10.1089/cmb.2018.0238

CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research

J Comput Biol. 2019 Apr;26(4):376-386. doi: 10.1089/cmb.2018.0238. Epub 2019 Feb 21.

Authors

Bruno César Feltes¹, Eduardo Bassani Chandelier¹, Bruno Iochins Grisci¹, Márcio Dorn¹

Affiliation

¹ Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil.

PMID: 30789283
DOI: 10.1089/cmb.2018.0238

Abstract

The employment of machine learning (ML) approaches to extract gene expression information from microarray studies has increased in the past years, specially on cancer-related works. However, despite this continuous interest in applying ML in cancer biomedical research, there are no curated repositories focused only on providing quality data sets exclusively for benchmarking and testing of such techniques for cancer research. Thus, in this work, we present the Curated Microarray Database (CuMiDa), a database composed of 78 handpicked microarray data sets for Homo sapiens that were carefully examined from more than 30,000 microarray experiments from the Gene Expression Omnibus using a rigorous filtering criteria. All data sets were individually submitted to background correction, normalization, sample quality analysis and were manually edited to eliminate erroneous probes. All data sets were tested using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) analyses to observe sample division and were additionally tested using various ML approaches to provide a base accuracy for the major techniques employed for microarray data sets. CuMiDa is a database created solely for benchmarking and testing of ML approaches applied to cancer research.

Keywords: benchmarking; cancer; classification; curation; machine learning; microarray; supervised learning; unsupervised learning.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Benchmarking
Computational Biology / methods
Data Curation / methods*
Gene Expression Profiling / methods*
Humans
Neoplasms / genetics*
Oligonucleotide Array Sequence Analysis
Principal Component Analysis
Unsupervised Machine Learning