A systematic evaluation of bioinformatics tools for identification of long noncoding RNAs

RNA. 2021 Jan;27(1):80-98. doi: 10.1261/rna.074724.120. Epub 2020 Oct 14.

Abstract

High-throughput RNA sequencing unveiled the complexity of transcriptome and significantly increased the records of long noncoding RNAs (lncRNAs), which were reported to participate in a variety of biological processes. Identification of lncRNAs is a key step in lncRNA analysis, and a bunch of bioinformatics tools have been developed for this purpose in recent years. While these tools allow us to identify lncRNA more efficiently and accurately, they may produce inconsistent results, making selection a confusing issue. We compared the performance of 41 analysis models based on 14 software packages and different data sets, including high-quality data and low-quality data from 33 species. In addition, computational efficiency, robustness, and joint prediction of the models were explored. As a practical guidance, key points for lncRNA identification under different situations were summarized. In this investigation, no one of these models could be superior to others under all test conditions. The performance of a model relied to a great extent on the source of transcripts and the quality of assemblies. As general references, FEELnc_all_cl, CPC, and CPAT_mouse work well in most species while COME, CNCI, and lncScore are good choices for model organisms. Since these tools are sensitive to different factors such as the species involved and the quality of assembly, researchers must carefully select the appropriate tool based on the actual data. Alternatively, our test suggests that joint prediction could behave better than any single model if proper models were chosen. All scripts/data used in this research can be accessed at http://bioinfo.ihb.ac.cn/elit.

Keywords: joint prediction; long noncoding RNA identification; non-model species; simulated and biological data sets; tools comparison.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Benchmarking
  • Computational Biology / methods*
  • Datasets as Topic
  • Genome*
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Mice
  • Models, Genetic
  • Molecular Sequence Annotation
  • Plants / genetics
  • RNA, Long Noncoding / classification
  • RNA, Long Noncoding / genetics*
  • RNA, Long Noncoding / metabolism
  • RNA, Messenger / classification
  • RNA, Messenger / genetics*
  • RNA, Messenger / metabolism
  • Software*
  • Species Specificity
  • Transcriptome

Substances

  • RNA, Long Noncoding
  • RNA, Messenger