A deep learning framework for enhancer prediction using word embedding and sequence generation

Biophys Chem. 2022 Jul:286:106822. doi: 10.1016/j.bpc.2022.106822. Epub 2022 May 5.

Abstract

Enhancers are non-coding DAN fragments that play key roles in gene regulations and can promote the transcription of structural genes, thereby affecting the expression of structural protein catalytic enzymes and regulatory proteins. Accurate identification of enhancers helps to understand the transcription of structural genes and the development of human tumorigenesis, diagnosis and treatment. The enhancer sequences have high position variations and dispersions, and the identification of enhancers is more challenging than other genetic factors. Based on word embedding and sequence generative adversarial networks, a deep learning framework for enhancer identification is proposed. Firstly, considering the small number of sequences in the benchmark dataset, RankGAN is used to amplify the dataset size while maintaining the data characteristics. Then, in view of the similarity between DNA sequence and natural language, DNA sequence is regarded as a sentence composed of multiple "words", and the word embedding technology FastText is applied to transform it into a numerical matrix. To extract the dependencies and highly abstract features of nucleotides in DNA sequences, a Long Short-Term Memory Convolutional Neural network (LSTM-CNN) is constructed to perform the identification task. On the independent test set, the accuracy and Matthew's correlation coefficient (MCC) for enhancer prediction are 0.7525 and 0.5051, respectively. For the enhancer type prediction, the accuracy and MCC of this method are 0.6972 and 0.3954, respectively. Compared with existing methods, this method achieves more satisfactory results for the prediction of enhancers and their types. This study will further enrich the application of natural language processing in bioinformatics.

Keywords: Enhancer; FastText; Natural language; RankGAN; Word embedding.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Computational Biology / methods
  • Deep Learning*
  • Humans
  • Neural Networks, Computer