Integrating clinical and cross-cohort metagenomic features: a stable and non-invasive colorectal cancer and adenoma diagnostic model

Front Mol Biosci. 2024 Jan 22:10:1298679. doi: 10.3389/fmolb.2023.1298679. eCollection 2023.

Abstract

Background: Dysbiosis is associated with colorectal cancer (CRC) and adenomas (CRA). However, the robustness of diagnostic models based on microbial signatures in multiple cohorts remains unsatisfactory. Materials and Methods: In this study, we used machine learning models to screen metagenomic signatures from the respective cross-cohort datasets of CRC and CRA (selected from CuratedMetagenomicData, each disease included 4 datasets). Then select a CRC and CRA data set from the CuratedMetagenomicData database and meet the requirements of having both metagenomic data and clinical data. This data set will be used to verify the inference that integrating clinical features can improve the performance of microbial disease prediction models. Results: After repeated verification, we selected 20 metagenomic features that performed well and were stably expressed within cross-cohorts to represent the diagnostic role of bacterial communities in CRC/CRA. The performance of the selected cross-cohort metagenomic features was stable for multi-regional and multi-ethnic populations (CRC, AUC: 0.817-0.867; CRA, AUC: 0.766-0.833). After clinical feature combination, AUC of our integrated CRC diagnostic model reached 0.939 (95% CI: 0.932-0.947, NRI=30%), and that of the CRA integrated model reached 0.925 (95%CI: 0.917-0.935, NRI=18%). Conclusion: In conclusion, the integrated model performed significantly better than single microbiome or clinical feature models in all cohorts. Integrating cross-cohort common discriminative microbial features with clinical features could help construct stable diagnostic models for early non-invasive screening for CRC and CRA.

Keywords: colorectal adenoma; colorectal cancer; gut microbiome; machine learning; metagenomics.

Grants and funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was supported by grants from the National Natural Science Foundation of China (81974062 and 81900477), International (Regional) Cooperation and Exchange Projects (81720108006) and Hubei Provincial Science and Technology Key Research and Development Project (General Project) (2020BHB017).