Functional and embedding feature analysis for pan-cancer classification

Jian Lu; JiaRui Li; Jingxin Ren; Shijian Ding; Zhenbing Zeng; Tao Huang; Yu-Dong Cai

doi:10.3389/fonc.2022.979336

Functional and embedding feature analysis for pan-cancer classification

Front Oncol. 2022 Sep 29:12:979336. doi: 10.3389/fonc.2022.979336. eCollection 2022.

Authors

Jian Lu^{1

2}, JiaRui Li³, Jingxin Ren⁴, Shijian Ding⁴, Zhenbing Zeng¹, Tao Huang^{2

5}, Yu-Dong Cai⁴

Affiliations

¹ Department of Mathematics, School of Sciences, Shanghai University, Shanghai, China.
² CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Science, Shanghai, China.
³ Advanced Research Computing, University of British Columbia, Vancouver, BC, Canada.
⁴ School of Life Sciences, Shanghai University, Shanghai, China.
⁵ CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China.

Abstract

With the increasing number of people suffering from cancer, this illness has become a major health problem worldwide. Exploring the biological functions and signaling pathways of carcinogenesis is essential for cancer detection and research. In this study, a mutation dataset for eleven cancer types was first obtained from a web-based resource called cBioPortal for Cancer Genomics, followed by extracting 21,049 features from three aspects: relationship to GO and KEGG (enrichment features), mutated genes learned by word2vec (text features), and protein-protein interaction network analyzed by node2vec (network features). Irrelevant features were then excluded using the Boruta feature filtering method, and the retained relevant features were ranked by four feature selection methods (least absolute shrinkage and selection operator, minimum redundancy maximum relevance, Monte Carlo feature selection and light gradient boosting machine) to generate four feature-ranked lists. Incremental feature selection was used to determine the optimal number of features based on these feature lists to build the optimal classifiers and derive interpretable classification rules. The results of four feature-ranking methods were integrated to identify key functional pathways, such as olfactory transduction (hsa04740) and colorectal cancer (hsa05210), and the roles of these functional pathways in cancers were discussed in reference to literature. Overall, this machine learning-based study revealed the altered biological functions of cancers and provided a reference for the mechanisms of different cancers.

Keywords: cancer mutation; embedding; enrichment; feature selection; pan-cancer; rule learning.