Predicting gene regulatory regions with a convolutional neural network for processing double-strand genome sequence information

PLoS One. 2020 Jul 23;15(7):e0235748. doi: 10.1371/journal.pone.0235748. eCollection 2020.

Abstract

With advances in sequencing technology, a vast amount of genomic sequence information has become available. However, annotating biological functions particularly of non-protein-coding regions in genome sequences without experiments is still a challenging task. Recently deep learning-based methods were shown to have the ability to predict gene regulatory regions from genome sequences, promising to aid the interpretation of genomic sequence data. Here, we report an improvement of the prediction accuracy for gene regulatory regions by using the design of convolution layers that efficiently process genomic sequence information, and developed a software, DeepGMAP, to train and compare different deep learning-based models (https://github.com/koonimaru/DeepGMAP). First, we demonstrate that our convolution layers, termed forward- and reverse-sequence scan (FRSS) layers, integrate both forward and reverse strand information, and enhance the power to predict gene regulatory regions. Second, we assessed previous studies and identified problems associated with data structures that caused overfitting. Finally, we introduce visualization methods to examine what the program learned. Together, our FRSS layers improve the prediction accuracy for gene regulatory regions.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • DNA / analysis*
  • DNA / genetics
  • Genome*
  • Genomics / methods*
  • Humans
  • Mice
  • Neural Networks, Computer*
  • Regulatory Sequences, Nucleic Acid*
  • Sequence Analysis, DNA / methods*
  • Software*

Substances

  • DNA

Associated data

  • figshare/10.6084/m9.figshare.6728348

Grants and funding

This work was supported in part by JSPS KAKENHI grant number 17K15132 to KO, a Special Postdoctoral Researcher Program of RIKEN to KO, and a research grant from MEXT to the RIKEN Center for Life Science Technologies and RIKEN Center for Biosystems Dynamics Research.