promSEMBLE: Hard Pattern Mining and Ensemble Learning for Detecting DNA Promoter Sequences

IEEE/ACM Trans Comput Biol Bioinform. 2024 Jan-Feb;21(1):208-214. doi: 10.1109/TCBB.2023.3339597. Epub 2024 Feb 5.

Abstract

Accurate identification of DNA promoter sequences is of crucial importance in unraveling the underlying mechanisms that regulate gene transcription. Initiation of transcription is controlled through regulatory transcription factors binding to promoter core regions in the DNA sequence. Detection of promoter regions is necessary if we are to build genetic regulatory networks for biomedical and clinical applications, and for identification of rarely expressed genes. We propose a novel ensemble learning technique using deep recurrent neural networks with convolutional feature extraction and hard negative pattern mining to detect several types of promoter sequences, including promoter sequences with the TATA-box and without the TATA-box, within DNA sequences of four different species. Using extensive independent tests and previously published results, we demonstrate that our method sets a new state-of-the-art of over 98% Matthews correlation coefficient in all eight organism categories for recognizing the stretch of base pairs that code for the promoter region within DNA sequences.

MeSH terms

  • Base Sequence
  • DNA* / genetics
  • Machine Learning*
  • Promoter Regions, Genetic / genetics
  • TATA Box
  • Transcription, Genetic

Substances

  • DNA