Deep learning of the back-splicing code for circular RNA formation

Bioinformatics. 2019 Dec 15;35(24):5235-5242. doi: 10.1093/bioinformatics/btz382.

Abstract

Motivation: Circular RNAs (circRNAs) are a new class of endogenous RNAs in animals and plants. During pre-RNA splicing, the 5' and 3' termini of exon(s) can be covalently ligated to form circRNAs through back-splicing (head-to-tail splicing). CircRNAs can be conserved across species, show tissue- and developmental stage-specific expression patterns, and may be associated with human disease. However, the mechanism of circRNA formation is still unclear although some sequence features have been shown to affect back-splicing.

Results: In this study, by applying the state-of-art machine learning techniques, we have developed the first deep learning model, DeepCirCode, to predict back-splicing for human circRNA formation. DeepCirCode utilizes a convolutional neural network (CNN) with nucleotide sequence as the input, and shows superior performance over conventional machine learning algorithms such as support vector machine and random forest. Relevant features learnt by DeepCirCode are represented as sequence motifs, some of which match human known motifs involved in RNA splicing, transcription or translation. Analysis of these motifs shows that their distribution in RNA sequences can be important for back-splicing. Moreover, some of the human motifs appear to be conserved in mouse and fruit fly. The findings provide new insight into the back-splicing code for circRNA formation.

Availability and implementation: All the datasets and source code for model construction are available at https://github.com/BioDataLearning/DeepCirCode.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Deep Learning*
  • Mice
  • RNA Splicing
  • RNA, Circular
  • Sequence Analysis, RNA

Substances

  • RNA, Circular