Harnessing NGS and Big Data Optimally: Comparison of miRNA Prediction from Assembled versus Non-assembled Sequencing Data--The Case of the Grass Aegilops tauschii Complex Genome

Hikmet Budak; Melda Kantar

doi:10.1089/omi.2015.0038

Harnessing NGS and Big Data Optimally: Comparison of miRNA Prediction from Assembled versus Non-assembled Sequencing Data--The Case of the Grass Aegilops tauschii Complex Genome

OMICS. 2015 Jul;19(7):407-15. doi: 10.1089/omi.2015.0038. Epub 2015 Jun 10.

Authors

Hikmet Budak¹, Melda Kantar¹

Affiliation

¹ Molecular Biology, Genetics and Bioengineering Program, Faculty of Engineering and Natural Sciences, Sabanci University , Istanbul, Turkey .

PMID: 26061358
DOI: 10.1089/omi.2015.0038

Abstract

MicroRNAs (miRNAs) are small, endogenous, non-coding RNA molecules that regulate gene expression at the post-transcriptional level. As high-throughput next generation sequencing (NGS) and Big Data rapidly accumulate for various species, efforts for in silico identification of miRNAs intensify. Surprisingly, the effect of the input genomics sequence on the robustness of miRNA prediction was not evaluated in detail to date. In the present study, we performed a homology-based miRNA and isomiRNA prediction of the 5D chromosome of bread wheat progenitor, Aegilops tauschii, using two distinct sequence data sets as input: (1) raw sequence reads obtained from 454-GS FLX Titanium sequencing platform and (2) an assembly constructed from these reads. We also compared this method with a number of available plant sequence datasets. We report here the identification of 62 and 22 miRNAs from raw reads and the assembly, respectively, of which 16 were predicted with high confidence from both datasets. While raw reads promoted sensitivity with the high number of miRNAs predicted, 55% (12 out of 22) of the assembly-based predictions were supported by previous observations, bringing specificity forward compared to the read-based predictions, of which only 37% were supported. Importantly, raw reads could identify several repeat-related miRNAs that could not be detected with the assembly. However, raw reads could not capture 6 miRNAs, for which the stem-loops could only be covered by the relatively longer sequences from the assembly. In summary, the comparison of miRNA datasets obtained by these two strategies revealed that utilization of raw reads, as well as assemblies for in silico prediction, have distinct advantages and disadvantages. Consideration of these important nuances can benefit future miRNA identification efforts in the current age of NGS and Big Data driven life sciences innovation.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Datasets as Topic
Genome, Plant*
High-Throughput Nucleotide Sequencing*
MicroRNAs / chemistry*
Molecular Sequence Annotation
Poaceae / genetics*
Sequence Analysis, RNA

Substances

MicroRNAs