A semiparametric kernel independence test with application to mutational signatures

J Am Stat Assoc. 2021;116(536):1648-1661. doi: 10.1080/01621459.2020.1871357. Epub 2021 Feb 16.

Abstract

Cancers arise owing to somatic mutations, and the characteristic combinations of somatic mutations form mutational signatures. Despite many mutational signatures being identified, mutational processes underlying a number of mutational signatures remain unknown, which hinders the identification of interventions that may reduce somatic mutation burdens and prevent the development of cancer. We demonstrate that the unknown cause of a mutational signature can be inferred by the associated signatures with known etiology. However, existing association tests are not statistically powerful due to excess zeros in mutational signatures data. To address this limitation, we propose a semiparametric kernel independence test (SKIT). The SKIT statistic is defined as the integrated squared distance between mixed probability distributions and is decomposed into four disjoint components to pinpoint the source of dependency. We derive the asymptotic null distribution and prove the asymptotic convergence of power. Due to slow convergence to the asymptotic null distribution, a bootstrap method is employed to compute p-values. Simulation studies demonstrate that when zeros are prevalent, SKIT is more resilient to power loss than existing tests and robust to random errors. We applied SKIT to The Cancer Genome Atlas (TCGA) mutational signatures data for over 9,000 tumors across 32 cancer types, and identified a novel association between signature 17 curated in the Catalogue Of Somatic Mutations In Cancer (COSMIC) and apolipoprotein B mRNA editing enzyme (APOBEC) signatures in gastrointestinal cancers. It indicates that APOBEC activity is likely associated with the unknown cause of signature 17.

Keywords: Excess zeros; Mutational signature; Rosenblatt-Parzen kernel estimator; Test of independence.