Employing bimodal representations to predict DNA bendability within a self-supervised pre-trained framework

Nucleic Acids Res. 2024 Apr 12;52(6):e33. doi: 10.1093/nar/gkae099.

Abstract

The bendability of genomic DNA, which measures the DNA looping rate, is crucial for numerous biological processes of DNA. Recently, an advanced high-throughput technique known as 'loop-seq' has made it possible to measure the inherent cyclizability of DNA fragments. However, quantifying the bendability of large-scale DNA is costly, laborious, and time-consuming. To close the gap between rapidly evolving large language models and expanding genomic sequence information, and to elucidate the DNA bendability's impact on critical regulatory sequence motifs such as super-enhancers in the human genome, we introduce an innovative computational model, named MIXBend, to forecast the DNA bendability utilizing both nucleotide sequences and physicochemical properties. In MIXBend, a pre-trained language model DNABERT and convolutional neural network with attention mechanism are utilized to construct both sequence- and physicochemical-based extractors for the sophisticated refinement of DNA sequence representations. These bimodal DNA representations are then fed to a k-mer sequence-physicochemistry matching module to minimize the semantic gap between each modality. Lastly, a self-attention fusion layer is employed for the prediction of DNA bendability. In conclusion, the experimental results validate MIXBend's superior performance relative to other state-of-the-art methods. Additionally, MIXBend reveals both novel and known motifs from the yeast. Moreover, MIXBend discovers significant bendability fluctuations within super-enhancer regions and transcription factors binding sites in the human genome.

MeSH terms

  • Base Sequence
  • Chemical Phenomena
  • Computational Biology* / methods
  • DNA* / chemistry
  • DNA* / genetics
  • Genome, Human
  • Genomics
  • Humans
  • Neural Networks, Computer
  • Protein Binding
  • Saccharomyces cerevisiae / genetics

Substances

  • DNA