LangMoDHS: A deep learning language model for predicting DNase I hypersensitive sites in mouse genome

Xingyu Tang; Peijie Zheng; Yuewu Liu; Yuhua Yao; Guohua Huang

doi:10.3934/mbe.2023048

LangMoDHS: A deep learning language model for predicting DNase I hypersensitive sites in mouse genome

Math Biosci Eng. 2023 Jan;20(1):1037-1057. doi: 10.3934/mbe.2023048. Epub 2022 Oct 24.

Authors

Xingyu Tang¹, Peijie Zheng¹, Yuewu Liu², Yuhua Yao³, Guohua Huang¹

Affiliations

¹ School of Electrical Engineering, Shaoyang University, Shaoyang 422000, China.
² College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China.
³ School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China.

PMID: 36650801
DOI: 10.3934/mbe.2023048

Abstract

DNase I hypersensitive sites (DHSs) are a specific genomic region, which is critical to detect or understand cis-regulatory elements. Although there are many methods developed to detect DHSs, there is a big gap in practice. We presented a deep learning-based language model for predicting DHSs, named LangMoDHS. The LangMoDHS mainly comprised the convolutional neural network (CNN), the bi-directional long short-term memory (Bi-LSTM) and the feed-forward attention. The CNN and the Bi-LSTM were stacked in a parallel manner, which was helpful to accumulate multiple-view representations from primary DNA sequences. We conducted 5-fold cross-validations and independent tests over 14 tissues and 4 developmental stages. The empirical experiments showed that the LangMoDHS is competitive with or slightly better than the iDHS-Deep, which is the latest method for predicting DHSs. The empirical experiments also implied substantial contribution of the CNN, Bi-LSTM, and attention to DHSs prediction. We implemented the LangMoDHS as a user-friendly web server which is accessible at http:/www.biolscience.cn/LangMoDHS/. We used indices related to information entropy to explore the sequence motif of DHSs. The analysis provided a certain insight into the DHSs.

Keywords: Bi-LSTM; CNN; DNase I hypersensitive site; deep learning; genome.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Animals
Deep Learning*
Deoxyribonuclease I / genetics
Deoxyribonuclease I / metabolism
Genomics
Mice
Regulatory Sequences, Nucleic Acid

Substances

Deoxyribonuclease I