KISL: knowledge-injected semi-supervised learning for biological co-expression network modules

Gangyi Xiao; Renchu Guan; Yangkun Cao; Zhenyu Huang; Ying Xu

doi:10.3389/fgene.2023.1151962

KISL: knowledge-injected semi-supervised learning for biological co-expression network modules

Front Genet. 2023 May 2:14:1151962. doi: 10.3389/fgene.2023.1151962. eCollection 2023.

Authors

Gangyi Xiao¹, Renchu Guan¹, Yangkun Cao², Zhenyu Huang¹, Ying Xu³

Affiliations

¹ College of Computer Science and Technology, Jilin University, Changchun, China.
² School of Artificial Intelligence Jilin University, Changchun, China.
³ School of Medicine, Southern University of Science and Technology, Shenzhen, Guangdong, China.

Abstract

The exploration of important biomarkers associated with cancer development is crucial for diagnosing cancer, designing therapeutic interventions, and predicting prognoses. The analysis of gene co-expression provides a systemic perspective on gene networks and can be a valuable tool for mining biomarkers. The main objective of co-expression network analysis is to discover highly synergistic sets of genes, and the most widely used method is weighted gene co-expression network analysis (WGCNA). With the Pearson correlation coefficient, WGCNA measures gene correlation, and uses hierarchical clustering to identify gene modules. The Pearson correlation coefficient reflects only the linear dependence between variables, and the main drawback of hierarchical clustering is that once two objects are clustered together, the process cannot be reversed. Hence, readjusting inappropriate cluster divisions is not possible. Existing co-expression network analysis methods rely on unsupervised methods that do not utilize prior biological knowledge for module delineation. Here we present a method for identification of outstanding modules in a co-expression network using a knowledge-injected semi-supervised learning approach (KISL), which utilizes apriori biological knowledge and a semi-supervised clustering method to address the issue existing in the current GCN-based clustering methods. To measure the linear and non-linear dependence between genes, we introduce a distance correlation due to the complexity of the gene-gene relationship. Eight RNA-seq datasets of cancer samples are used to validate its effectiveness. In all eight datasets, the KISL algorithm outperformed WGCNA when comparing the silhouette coefficient, Calinski-Harabasz index and Davies-Bouldin index evaluation metrics. According to the results, KISL clusters had better cluster evaluation values and better gene module aggregation. Enrichment analysis of the recognition modules demonstrated their effectiveness in discovering modular structures in biological co-expression networks. In addition, as a general method, KISL can be applied to various co-expression network analyses based on similarity metrics. Source codes for the KISL and the related scripts are available online at https://github.com/Mowonhoo/KISL.git.

Keywords: biological co-expression network; factor analysis; feature selection; network modules identification; semi-supervised learning algorithm.

Grants and funding

Our work is supported by the National Key Research and Development Program of China No. 2021YFF1201200, the National Natural Science Foundation of China No. 62172187 and No. 61972174, Liaoning Provincial Archives Science and Technology Project (Grant No. 2021-X-012 and Grant No. 2022-X-017), and Guangdong Universities’ Innovation Team Project (No. 2021KCXTD015) and Guangdong Key Disciplines Project (No. 2021ZDJS138).