Construction and evaluation of a metabolic correlation diagnostic model for diabetes based on machine learning algorithms

Environ Toxicol. 2024 Apr 29. doi: 10.1002/tox.24213. Online ahead of print.

Abstract

Background: Diabetes mellitus (DM) is a prevalent chronic disease marked by significant metabolic dysfunctions. Understanding its molecular mechanisms is vital for early diagnosis and treatment strategies.

Methods: We used datasets GSE7014, GSE25724, and GSE156248 from the GEO database to build a diagnostic model for DM using Random Forest (RF) and LASSO regression models. GSE20966 served as a validation cohort. DM patients were classified into two subtypes for functional enrichment analysis. Expression levels of key diagnostic genes were validated using quantitative real-time PCR (qRT-PCR) on Peripheral Blood Mononuclear Cells (PBMCs) from DM patients and healthy controls, focusing on CXCL12 and PPP1R12B with GAPDH as the internal control.

Results: After de-batching the datasets, we identified 131 differentially expressed genes (DEGs) between DM and control groups, with 70 up-regulated and 61 down-regulated. Enrichment analysis revealed significant down-regulation in the IL-12 signaling pathway, JAK signaling post-IL-12 stimulation, and the ferroptosis pathway in DM. Five genes (CXCL12, MXRA5, UCHL1, PPP1R12B, and C7) were identified as having diagnostic value. The diagnostic model showed high accuracy in both the training and validation cohorts. The gene set also enabled the subclassification of DM patients into groups with distinct functional traits. qRT-PCR results confirmed the bioinformatics findings, particularly the up-regulation of CXCL12 and PPP1R12B in DM patients.

Conclusion: Our study pinpointed seven energy metabolism-related genes differentially expressed in DM and controls, with five holding diagnostic value. Our model accurately diagnosed DM and facilitated patient subclassification, offering new insights into DM pathogenesis.

Keywords: diabetes mellitus; differential gene expression; disease subtyping; gene expression omnibus (GEO); machine learning models (random Forest, LASSO).