A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis

Eunjee Lee; Seungyeul Yoo; Wenhui Wang; Zhidong Tu; Jun Zhu

doi:10.1093/gigascience/giz080

A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis

Gigascience. 2019 Jul 1;8(7):giz080. doi: 10.1093/gigascience/giz080.

Authors

Eunjee Lee^{1

2

3}, Seungyeul Yoo^{1

2}, Wenhui Wang^{1

2}, Zhidong Tu^{1

2}, Jun Zhu^{1

2

3

4}

Affiliations

¹ Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029, USA.
² Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029, USA.
³ Sema4, a Mount Sinai venture, 333 Ludlow street, Stamford, CT 06902, USA.
⁴ The Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029, USA.

Abstract

Background: Data errors, including sample swapping and mis-labeling, are inevitable in the process of large-scale omics data generation. Data errors need to be identified and corrected before integrative data analyses where different types of data are merged on the basis of the annotated labels. Data with labeling errors dampen true biological signals. More importantly, data analysis with sample errors could lead to wrong scientific conclusions. We developed a robust probabilistic multi-omics data matching procedure, proMODMatcher, to curate data and identify and correct data annotation and errors in large databases.

Results: Application to simulated datasets suggests that proMODMatcher achieved robust statistical power even when the number of cis-associations was small and/or the number of samples was large. Application of our proMODMatcher to multi-omics datasets in The Cancer Genome Atlas and International Cancer Genome Consortium identified sample errors in multiple cancer datasets. Our procedure was not only able to identify sample-labeling errors but also to unambiguously identify the source of the errors. Our results demonstrate that these errors should be identified and corrected before integrative analysis.

Conclusions: Our results indicate that sample-labeling errors were common in large multi-omics datasets. These errors should be corrected before integrative analysis.

Keywords: data curation; data error; omics data integration.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Data Accuracy
Databases, Genetic / standards*
Genome, Human
Genomics / methods*
Genomics / standards
Humans
Neoplasms / genetics*
Probability
Software*
Transcriptome

Abstract

Publication types

MeSH terms

Grants and funding