Accurate prediction of genome-wide RNA secondary structure profile based on extreme gradient boosting

Bioinformatics. 2020 Nov 1;36(17):4576-4582. doi: 10.1093/bioinformatics/btaa534.

Abstract

Motivation: RNA secondary structure plays a vital role in fundamental cellular processes, and identification of RNA secondary structure is a key step to understand RNA functions. Recently, a few experimental methods were developed to profile genome-wide RNA secondary structure, i.e. the pairing probability of each nucleotide, through high-throughput sequencing techniques. However, these high-throughput methods have low precision and cannot cover all nucleotides due to limited sequencing coverage.

Results: Here, we have developed a new method for the prediction of genome-wide RNA secondary structure profile from RNA sequence based on the extreme gradient boosting technique. The method achieves predictions with areas under the receiver operating characteristic curve (AUC) >0.9 on three different datasets, and AUC of 0.888 by another independent test on the recently released Zika virus data. These AUCs are consistently >5% greater than those by the CROSS method recently developed based on a shallow neural network. Further analysis on the 1000 Genome Project data showed that our predicted unpaired probabilities are highly correlated (>0.8) with the minor allele frequencies at synonymous, non-synonymous mutations, and mutations in untranslated regions, which were higher than those generated by RNAplfold. Moreover, the prediction over all human mRNA indicated a consistent result with previous observation that there is a periodic distribution of unpaired probability on codons. The accurate predictions by our method indicate that such model trained on genome-wide experimental data might be an alternative for analytical methods.

Availability and implementation: The GRASP is available for academic use at https://github.com/sysu-yanglab/GRASP.

Supplementary information: Supplementary data are available online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Base Sequence
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Neural Networks, Computer
  • RNA / genetics
  • Software
  • Zika Virus Infection*
  • Zika Virus*

Substances

  • RNA