MSPJ: Discovering potential biomarkers in small gene expression datasets via ensemble learning

Comput Struct Biotechnol J. 2022 Jul 14:20:3783-3795. doi: 10.1016/j.csbj.2022.07.022. eCollection 2022.

Abstract

In transcriptomics, differentially expressed genes (DEGs) provide fine-grained phenotypic resolution for comparisons between groups and insights into molecular mechanisms underlying the pathogenesis of complex diseases or phenotypes. The robust detection of DEGs from large datasets is well-established. However, owing to various limitations (e.g., the low availability of samples for some diseases or limited research funding), small sample size is frequently used in experiments. Therefore, methods to screen reliable and stable features are urgently needed for analyses with limited sample size. In this study, MSPJ, a new machine learning approach for identifying DEGs was proposed to mitigate the reduced power and improve the stability of DEG identification in small gene expression datasets. This ensemble learning-based method consists of three algorithms: an improved multiple random sampling with meta-analysis, SVM-RFE (support vector machines-recursive feature elimination), and permutation test. MSPJ was compared with ten classical methods by 94 simulated datasets and large-scale benchmarking with 165 real datasets. The results showed that, among these methods MSPJ had the best performance in most small gene expression datasets, especially those with sample size below 30. In summary, the MSPJ method enables effective feature selection for robust DEG identification in small transcriptome datasets and is expected to expand research on the molecular mechanisms underlying complex diseases or phenotypes.

Keywords: AUC, area under the ROC curve (AUC); DEGs, differentially expressed genes; Differentially expressed genes; FDR, false positive rate; Feature selection; GA, genetic algorithm; GEO, Gene Expression Omnibus; GO, gene ontology; MSPJ, the Joint method of Meta-analysis, SVM-RFE, and Permutation test; Machine learning; RF, random forest; ROC, receiver operating characteristic; Random sampling; SAM, significance analysis of microarrays; SMDs, standardized mean differences; SNR, signal noise ratio; SVM-RFE, support vector machines-recursive feature elimination; Small sample size; mRMR, minimum-redundancy-maximum-relevance.