Cancer classification based on multiple dimensions: SNV patterns

Comput Biol Med. 2022 Dec;151(Pt A):106270. doi: 10.1016/j.compbiomed.2022.106270. Epub 2022 Nov 11.

Abstract

Background: The occurrence of cancer is closely related to single nucleotide variants (SNVs). However, in DNA samples collected from patients with distinct cancers, SNVs are detected in different patterns. Therefore, it is an important task to select the appropriate method by which to classify cancer to the greatest extent of SNV patterns, which will aid in cancer diagnosis and treatment. In traditional studies, researchers combined each SNV with its neighboring nucleotides to form a trinucleotide. Mutation signatures for cancer classification were extracted from the patterns of the trinucleotides, but the SNV feature extraction in a single dimension may result in partial information loss and poor model performance.

Results: In this study, we defined multidimensional SNV (M-SNV) features to classify cancer. M-SNV features considered first- and second-order neighboring nucleotides of one-dimensional SNVs and included six types of features. We validated the feasibility of M-SNV features using a dataset obtained from The Cancer Genome Atlas (TCGA) consisting of 2761 samples from 12 cancers. We performed preliminary screening of 562,321 DNA mutation sites in these samples. The remaining mutation sites were characterized by cancer type in six signatures. We found that the extracted features showed a similar distribution in the cluster center of the cancer type of the samples. After the preprocessing of raw data, samples were more focused on the cancer subtype distributions at the SNV level. We used KNN (k-nearest neighbors) to classify the extracted features and employed the leave-one-out cross to verify them. The accuracy of classifying is stable at approximately 97% and can reach 97.43% in the most optimal case. Furthermore, we found that the validated oncogenes in the loci of the features had the highest importance among the 8 cancers.

Conclusions: It is feasible to classify cancers by the distribution of features we defined. Moreover, our methodology has potential implications for the discovery of oncogenes.

Keywords: Cancer classification; KNN; Multidimensional SNV feature; Oncogene.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Cluster Analysis
  • Humans
  • Mutation
  • Neoplasms* / genetics
  • Nucleotides
  • Oncogenes*

Substances

  • Nucleotides