MiRmat: mature microRNA sequence prediction

PLoS One. 2012;7(12):e51673. doi: 10.1371/journal.pone.0051673. Epub 2012 Dec 27.

Abstract

Background: MicroRNAs are known to be generated from primary transcripts mainly through the sequential cleavages by two enzymes, Drosha and Dicer. The sequence of a mature microRNA, especially the 'seeding sequence', largely determines its binding ability and specificity to target mRNAs. Therefore, methods that predict mature microRNA sequences with high accuracy will benefit the identification and characterization of novel microRNAs and their targets, and contribute to inferring the post-transcriptional regulation network at a genome scale.

Methodology/principal findings: We have developed a method, MiRmat, to predict the mature microRNA sequence. MiRmat is essentially composed of two parts: the prediction of Drosha processing site and the identification of Dicer processing site. Based on the analysis of microRNAs from 12 species, we found that the patterns of free energy profiles are conserved among vertebrate microRNA hairpins. Therefore, we introduced in our method the free energy distribution pattern of the downstream part of pri-microRNA secondary structure and Random Forest algorithm to predict the mature microRNA sequence. Based on the evaluation on an independent test dataset from 10 vertebrates, MiRmat was shown to identify 77.8% of the Drosha processing sites and 92.8% of the Dicer sites within a deviation of 2 nt. In a more stringent evaluation by excluding the microRNAs sharing the same family between the training set and test set, MiRmat kept a rather well performance of 71.9% and 87.2% of the identification rate on the Drosha and Dicer site respectively, which represents the ability to deal with the novel microRNA family. MiRmat outperforms other state-of-the-art methods and has a high degree of efficacy for the prediction of mature microRNA sequences of vertebrates.

Conclusion: MiRmat was developed for identifying microRNA mature sequence(s) by introducing the free energy distribution of RNA stem-loop structure and the Random Forest algorithm. We prove that MiRmat has better performance than the existing tools and is applicable among vertebrates. MiRmat is freely available at http://mcube.nju.edu.cn/jwang/lab/soft/MiRmat/.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Animals
  • DEAD-box RNA Helicases / metabolism*
  • Gene Expression Regulation*
  • Humans
  • MicroRNAs / chemistry
  • MicroRNAs / genetics*
  • RNA, Messenger / genetics
  • Ribonuclease III / metabolism*
  • Species Specificity

Substances

  • MicroRNAs
  • RNA, Messenger
  • DICER1 protein, human
  • DROSHA protein, human
  • Ribonuclease III
  • DEAD-box RNA Helicases

Grants and funding

This work was supported by grants from the National Natural Science Foundation of China (30890044, 31071232, 61021062, 61173068), the National Basic Research Program (2007CB814806, 2010CB327903) and the Postdoctoral Science Foundation of China (20090461086). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.