[Unsupervised deep learning for identifying the O 6-carboxymethyl guanine by nanopore sequencing]

Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2022 Feb 25;39(1):139-148. doi: 10.7507/1001-5515.202104068.
[Article in Chinese]

Abstract

O 6-carboxymethyl guanine(O 6-CMG) is a highly mutagenic alkylation product of DNA that causes gastrointestinal cancer in organisms. Existing studies used mutant Mycobacterium smegmatis porin A (MspA) nanopore assisted by Phi29 DNA polymerase to localize it. Recently, machine learning technology has been widely used in the analysis of nanopore sequencing data. But the machine learning always need a large number of data labels that have brought extra work burden to researchers, which greatly affects its practicability. Accordingly, this paper proposes a nano-Unsupervised-Deep-Learning method (nano-UDL) based on an unsupervised clustering algorithm to identify methylation events in nanopore data automatically. Specially, nano-UDL first uses the deep AutoEncoder to extract features from the nanopore dataset and then applies the MeanShift clustering algorithm to classify data. Besides, nano-UDL can extract the optimal features for clustering by joint optimizing the clustering loss and reconstruction loss. Experimental results demonstrate that nano-UDL has relatively accurate recognition accuracy on the O 6-CMG dataset and can accurately identify all sequence segments containing O 6-CMG. In order to further verify the robustness of nano-UDL, hyperparameter sensitivity verification and ablation experiments were carried out in this paper. Using machine learning to analyze nanopore data can effectively reduce the additional cost of manual data analysis, which is significant for many biological studies, including genome sequencing.

O 6-甲基鸟嘌呤(O 6-CMG)是DNA中的一种高致突变烷基化产物,它会导致生命体罹患胃肠道肿瘤。现有的研究主要是利用耻垢分枝杆菌膜蛋白(MspA)纳米孔技术,借助枯草芽孢杆菌噬菌体Phi29 DNA多聚酶(Phi29 DNA polymerase)对突变进行精确定位。近年来,机器学习技术被广泛应用于纳米孔测序数据的分析,但是机器学习往往需要大量的数据标记,这给研究者们带来了额外的工作负担,大大影响了其实用性。因此,本文提出了一种纳米无监督深度学习(nano-UDL)方法,该方法能自动识别含有突变段的纳米孔数据。nano-UDL方法利用深度自动编码器从纳米孔数据中提取特征,然后通过均值漂移(MeanShift)聚类算法对特征数据进行分类。此外,该方法还联合优化了聚类损失和重构损失,从而提取最优的特征用于聚类。实验结果表明,nano-UDL方法在O 6-CMG数据集上具有较高的识别精度,能准确识别出所有包含O 6-CMG的序列段。为了进一步验证nano-UDL方法的鲁棒性,本文进行了超参数敏感性验证和消融实验。利用nano-UDL方法分析纳米孔数据不但可以有效降低人工分析数据带来的额外成本,而且对包括基因组测序在内的诸多生物研究具有重要意义。.

Keywords: Carboxymethyl guanine; DNA lesion; Deep Learning; Gastrointestinal cancer; Nanopore sequencing; Unsupervised learning.

MeSH terms

  • Deep Learning*
  • Guanine
  • Nanopore Sequencing*
  • Nanopores*
  • Porins / genetics

Substances

  • Porins
  • Guanine

Grants and funding

国家自然科学基金(61861130366,61876082,61732006,62136004);国家重点研发计划(2018YFC2001600, 2018YFC2001602)