The construction of transcriptional risk scores for breast cancer based on lightGBM and multiple omics data

Math Biosci Eng. 2022 Aug 24;19(12):12353-12370. doi: 10.3934/mbe.2022576.

Abstract

Background: Polygenic risk score (PRS) can evaluate the individual-level genetic risk of breast cancer. However, standalone single nucleotide polymorphisms (SNP) data used for PRS may not provide satisfactory prediction accuracy. Additionally, current PRS models based on linear regression have insufficient power to leverage non-linear effects from thousands of associated SNPs. Here, we proposed a transcriptional risk score (TRS) based on multiple omics data to estimate the risk of breast cancer.

Methods: The multiple omics data and clinical data of breast invasive carcinoma (BRCA) were collected from the cancer genome atlas (TCGA) and the gene expression omnibus (GEO). First, we developed a novel TRS model for BRCA utilizing single omic data and LightGBM algorithm. Subsequently, we built a combination model of TRS derived from each omic data to further improve the prediction accuracy. Finally, we performed association analysis and prognosis prediction to evaluate the utility of the TRS generated by our method.

Results: The proposed TRS model achieved better predictive performance than the linear models and other ML methods in single omic dataset. An independent validation dataset also verified the effectiveness of our model. Moreover, the combination of the TRS can efficiently strengthen prediction accuracy. The analysis of prevalence and the associations of the TRS with phenotypes including case-control and cancer stage indicated that the risk of breast cancer increases with the increases of TRS. The survival analysis also suggested that TRS for the cancer stage is an effective prognostic metric of breast cancer patients.

Conclusions: Our proposed TRS model expanded the current definition of PRS from standalone SNP data to multiple omics data and outperformed the linear models, which may provide a powerful tool for diagnostic and prognostic prediction of breast cancer.

Keywords: breast cancer; diagnosis; lightGBM; multiple omics data; prognosis; transcriptional risk scores.

MeSH terms

  • Algorithms*
  • Genome-Wide Association Study
  • Neoplasms*
  • Risk Factors
  • Survival Analysis