Predicting functional long non-coding RNAs validated by low throughput experiments

RNA Biol. 2019 Nov;16(11):1555-1564. doi: 10.1080/15476286.2019.1644590. Epub 2019 Jul 26.

Abstract

High-throughput techniques have uncovered hundreds and thousands of long non-coding RNAs (lncRNAs). Among them, only a tiny fraction has experimentally validated functions (EVlncRNAs) by low-throughput methods. What fraction of lncRNAs from high-throughput experiments (HTlncRNAs) is truly functional is an active subject of debate. Here, we developed the first method to distinguish EVlncRNAs from HTlncRNAs and mRNAs by using Support Vector Machines and found that EVlncRNAs can be well separated from HTlncRNAs and mRNAs with 0.6 for Matthews correlation coefficient, 64% for sensitivity, and 81% for precision for the independent human test set. The most useful features for classification are related to sequence conservations at RNA (for separating from HTlncRNAs) and protein (for separating from mRNA) levels. The method is found to be robust as the human-RNA-trained model is applicable to independent mouse RNAs with similar accuracy and to a lesser extent to plant RNAs. The method can recover newly discovered EVlncRNAs with high sensitivity. Its application to randomly selected 2000 human HTlncRNAs indicates that the majority of HTlncRNAs is probably non-functional but a large portion (nearly 30%) are likely functional. In other words, there is an ample number of lncRNAs whose specific biological roles are yet to be discovered. The method developed here is expected to speed up and reduce the cost of the discovery by prioritizing potentially functional lncRNAs prior to experimental validation. EVlncRNA-pred is available as a web server at http://biophy.dzu.edu.cn/lncrnapred/index.html . All datasets used in this study can be obtained from the same website.

Keywords: Long non-coding RNAs; functional lncRNAs; low throughput experiments; prediction.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Animals
  • Computational Biology / methods*
  • Humans
  • Mice
  • Molecular Sequence Annotation
  • RNA, Long Noncoding / genetics*
  • Sequence Analysis, RNA / methods*
  • Support Vector Machine

Substances

  • RNA, Long Noncoding

Grants and funding

This work was supported by the National Natural Science Foundation of China [61671107, 61271378, 61801081]; Taishan Scholars Program of Shandong province of China [Tshw201502045]; National Health and Medical Research Council of Australia [1121629 to Y.Z.]; Australia Research Council [DP 180102060 to Y.Z.]; and Talent Introduction Project of Dezhou University of China [320111 to B.Z.].