Improving the Quantification of DNA Sequences Using Evolutionary Information Based on Deep Learning

Cells. 2019 Dec 14;8(12):1635. doi: 10.3390/cells8121635.

Abstract

It is known that over 98% of the human genome is non-coding, and 93% of disease associated variants are located in these regions. Therefore, understanding the function of these regions is important. However, this task is challenging as most of these regions are not well understood in terms of their functions. In this paper, we introduce a novel computational model based on deep neural networks, called DQDNN, for quantifying the function of non-coding DNA regions. This model combines convolution layers for capturing regularity motifs at multiple scales and recurrent layers for capturing long term dependencies between the captured motifs. In addition, we show that integrating evolutionary information with raw genomic sequences improves the performance of the predictor significantly. The proposed model outperforms the state-of-the-art ones using raw genomics sequences only and also by integrating evolutionary information with raw genomics sequences. More specifically, the proposed model improves 96.9% and 98% of the targets in terms of area under the receiver operating characteristic curve and the precision-recall curve, respectively. In addition, the proposed model improved the prioritization of functional variants of expression quantitative trait loci (eQTLs) compared with the state-of-the-art models.

Keywords: DNA computing; LSTM; convolution neural network; deep learning; evolutionary information; non-coding DNA.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Biological Evolution
  • Computational Biology / methods*
  • DNA / genetics
  • Deep Learning / trends
  • Evolution, Molecular
  • Genome, Human / genetics*
  • Genomics
  • Humans
  • Neural Networks, Computer
  • ROC Curve
  • Sequence Analysis, DNA / methods*

Substances

  • DNA