BERT-5mC: an interpretable model for predicting 5-methylcytosine sites of DNA based on BERT

PeerJ. 2023 Dec 8:11:e16600. doi: 10.7717/peerj.16600. eCollection 2023.

Abstract

DNA 5-methylcytosine (5mC) is widely present in multicellular eukaryotes, which plays important roles in various developmental and physiological processes and a wide range of human diseases. Thus, it is essential to accurately detect the 5mC sites. Although current sequencing technologies can map genome-wide 5mC sites, these experimental methods are both costly and time-consuming. To achieve a fast and accurate prediction of 5mC sites, we propose a new computational approach, BERT-5mC. First, we pre-trained a domain-specific BERT (bidirectional encoder representations from transformers) model by using human promoter sequences as language corpus. BERT is a deep two-way language representation model based on Transformer. Second, we fine-tuned the domain-specific BERT model based on the 5mC training dataset to build the model. The cross-validation results show that our model achieves an AUROC of 0.966 which is higher than other state-of-the-art methods such as iPromoter-5mC, 5mC_Pred, and BiLSTM-5mC. Furthermore, our model was evaluated on the independent test set, which shows that our model achieves an AUROC of 0.966 that is also higher than other state-of-the-art methods. Moreover, we analyzed the attention weights generated by BERT to identify a number of nucleotide distributions that are closely associated with 5mC modifications. To facilitate the use of our model, we built a webserver which can be freely accessed at: http://5mc-pred.zhulab.org.cn.

Keywords: BERT; DNA 5-methylcytosine; Fine-tuning; Machine learning; Natural language processing; Webserver.

MeSH terms

  • 5-Methylcytosine*
  • DNA* / genetics
  • Electric Power Supplies
  • Eukaryota
  • Humans
  • Language

Substances

  • 5-Methylcytosine
  • DNA

Grants and funding

This work was supported by the National Natural Science Foundation of China (No. 21403002), the Young Wanjiang Scholar Program of Anhui Province, and the Research Program of Education Department of Anhui Province (YJS20210223, 2023AH050998). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.