Machine learning-based tissue of origin classification for cancer of unknown primary diagnostics using genome-wide mutation features

Nat Commun. 2022 Jul 11;13(1):4013. doi: 10.1038/s41467-022-31666-w.

Abstract

Cancers of unknown primary (CUP) origin account for ∼3% of all cancer diagnoses, whereby the tumor tissue of origin (TOO) cannot be determined. Using a uniformly processed dataset encompassing 6756 whole-genome sequenced primary and metastatic tumors, we develop Cancer of Unknown Primary Location Resolver (CUPLR), a random forest TOO classifier that employs 511 features based on simple and complex somatic driver and passenger mutations. CUPLR distinguishes 35 cancer (sub)types with ∼90% recall and ∼90% precision based on cross-validation and test set predictions. We find that structural variant derived features increase the performance and utility for classifying specific cancer types. With CUPLR, we could determine the TOO for 82/141 (58%) of CUP patients. Although CUPLR is based on machine learning, it provides a human interpretable graphical report with detailed feature explanations. The comprehensive output of CUPLR complements existing histopathological procedures and can enable improved diagnostics for CUP patients.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Genome
  • Humans
  • Machine Learning
  • Mutation
  • Neoplasms, Unknown Primary* / diagnosis
  • Neoplasms, Unknown Primary* / genetics