An Integrative Framework for Combining Sequence and Epigenomic Data to Predict Transcription Factor Binding Sites Using Deep Learning

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):355-364. doi: 10.1109/TCBB.2019.2901789. Epub 2021 Feb 3.

Abstract

Knowing the transcription factor binding sites (TFBSs) is essential for modeling the underlying binding mechanisms and follow-up cellular functions. Convolutional neural networks (CNNs) have outperformed methods in predicting TFBSs from the primary DNA sequence. In addition to DNA sequences, histone modifications and chromatin accessibility are also important factors influencing their activity. They have been explored to predict TFBSs recently. However, current methods rarely take into account histone modifications and chromatin accessibility using CNN in an integrative framework. To this end, we developed a general CNN model to integrate these data for predicting TFBSs. We systematically benchmarked a series of architecture variants by changing network structure in terms of width and depth, and explored the effects of sample length at flanking regions. We evaluated the performance of the three types of data and their combinations using 256 ChIP-seq experiments and also compared it with competing machine learning methods. We find that contributions from these three types of data are complementary to each other. Moreover, the integrative CNN framework is superior to traditional machine learning methods with significant improvements.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Binding Sites / genetics*
  • Chromatin Immunoprecipitation Sequencing
  • Computational Biology / methods*
  • Deep Learning*
  • Epigenesis, Genetic / genetics*
  • Histone Code / genetics
  • Protein Binding
  • Transcription Factors / chemistry
  • Transcription Factors / genetics*
  • Transcription Factors / metabolism

Substances

  • Transcription Factors