[Identification of potential regulatory genes for embryonic stem cell self-renewal and pluripotency by random forest]

Nan Fang Yi Ke Da Xue Xue Bao. 2021 Aug 20;41(8):1234-1238. doi: 10.12122/j.issn.1673-4254.2021.08.16.
[Article in Chinese]

Abstract

Objective: To identify novel genes associated with self-renewal and pluripotency of mouse embryonic stem cells(mESCs)by integrating multiomics data based on machine learning methods.

Methods: We integrated multiomics information of mESCs involving transcriptome, histone modifications, chromatin accessibility, transcription factor binding and architectural protein binding, and compared the signal differences between known stem cell self-renewal and pluripotency genes and other genes.By integrating these multiomics data, we established prediction models based on several machine learning classifiers including random forests and performed 5-fold cross validations.The model was trained using the training dataset containing two thirds of the input samples, and the remaining one third of the input samples were used as the test dataset to assess the performance of the model in independent tests.Finally, the results predicted by the model were validated through gene function annotation and cell function experiments including cell viability assay, colony formation assay and cell cycle analysis.

Results: Compared with the random genes, the genes known to be associated with self-renewal and pluripotency of mESCs in the multiomics data showed significantly different features.Random forest outperformed the other machine learning algorithms tested on these multiomics data, with an area under the curve (AUC) of 0.883±0.018 for cross validation and an AUC of 0.880±0.028 for independent test.Based on this model, we identified 893 potential regulatory genes associated wwith self-renewal and pluripotency of mESCs, which were similar to the known genes in functional annotation.Known-down of the predicted novel regulator gene Cct6a resulted in significant decreases in the cell viability of mESCs (P < 0.0001) and the number of cell clones (P < 0.01), significantly increased the number of cells in G1 phase (P < 0.01) and decreasedthe number of S phase cells (P < 0.05).Knockdown of Cct6a also led to failure of positive alkaline phosphatase staining of the mESCs.

Conclusion: Machine learning model based on multiomics data can be used to predict potential self-renewal and pluripotency regulators with high performance.By using this model, we predicted potential self-renewal and pluripotency regulatory genes including Cct6a and applied experimental validation.This model provides new insights into the regulatory mechanism of mESCs and contribute to stem cell research.

目的: 基于机器学习的方法整合多组学数据在小鼠胚胎干细胞(mESCs)中鉴定潜在的与干细胞自我更新及多能性相关的基因。

方法: 收集了mESCs的多组学数据,包括转录组、组蛋白修饰、染色质可及性、转录因子及结构蛋白在染色质上的结合等信息,比较了已知的干细胞自我更新及多能性基因与其他基因的信号差异。整合这些多组学数据,基于包含随机森林在内的多种机器学习分类器构建预测模型并进行了5折的交叉验证。输入的样本中2/3作为训练集用于训练模型,剩余的1/3作为测试集用于独立测试来衡量模型的表现。最终通过基因功能注释和细胞活力测定、克隆形成测定及细胞周期分析等细胞功能实验对模型预测的结果进行了验证。

结果: 已知的多能性与自我更新基因在多组学数据中有显著区别于随机基因的特征。使用这些数据的算法中随机森林构建的模型具有最好的表现,交叉验证的曲线下面积(AUC)为0.883±0.018,独立测试的AUC为0.880±0.028。该模型鉴定出了893个潜在的自我更新与多能性相关基因,这些基因在基因功能注释上与已知基因类似,而敲低其中新发现的基因Cct6a会导致mESCs的细胞活性显著降低(P < 0.0001),形成细胞克隆的数目显著减少(P < 0.01),处于G1期的细胞数量显著增加(P < 0.01)而处于S期的细胞数量显著减少(P < 0.05)。另外,敲低Cct6a基因的mESCs无法被碱性磷酸酶染色。

结论: 基于多组学数据构建的机器学习模型可以预测潜在的自我更新与多能性相关调控因子且具有较好的效果。通过构建的模型发现了潜在的自我更新与多能性调控基因如Cct6a并进行了实验验证。

Keywords: machine learning; mouse embryonic stem cells; pluripotency; random forest; self-renewal.

MeSH terms

  • Animals
  • Cell Differentiation
  • Cell Self Renewal*
  • Cell Survival
  • Genes, Regulator
  • Mice
  • Mouse Embryonic Stem Cells*

Grants and funding

国家重点研发计划(2017YFA0102800,2016YFA0101700);国家自然科学基金(31970811,31771639,32170798,31671420,81602482);广州再生医学与健康广东省实验室前沿研究项目(2018GZR110105007);广东省引进创新创业团队项目(2016ZT06S029)