LncRNA-ID: Long non-coding RNA IDentification using balanced random forests

Rujira Achawanantakun; Jiao Chen; Yanni Sun; Yuan Zhang

doi:10.1093/bioinformatics/btv480

LncRNA-ID: Long non-coding RNA IDentification using balanced random forests

Bioinformatics. 2015 Dec 15;31(24):3897-905. doi: 10.1093/bioinformatics/btv480. Epub 2015 Aug 26.

Authors

Rujira Achawanantakun¹, Jiao Chen¹, Yanni Sun¹, Yuan Zhang¹

Affiliation

¹ Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.

PMID: 26315901
DOI: 10.1093/bioinformatics/btv480

Abstract

Motivation: Long non-coding RNAs (lncRNAs), which are non-coding RNAs of length above 200 nucleotides, play important biological functions such as gene expression regulation. To fully reveal the functions of lncRNAs, a fundamental step is to annotate them in various species. However, as lncRNAs tend to encode one or multiple open reading frames, it is not trivial to distinguish these long non-coding transcripts from protein-coding genes in transcriptomic data.

Results: In this work, we design a new tool that calculates the coding potential of a transcript using a machine learning model (random forest) based on multiple features including sequence characteristics of putative open reading frames, translation scores based on ribosomal coverage, and conservation against characterized protein families. The experimental results show that our tool competes favorably with existing coding potential computation tools in lncRNA identification.

Availability and implementation: The scripts and data can be downloaded at https://github.com/zhangy72/LncRNA-ID.

Publication types

Evaluation Study
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Animals
Humans
Machine Learning*
Mice
Open Reading Frames
Proteins / genetics
RNA, Long Noncoding / genetics*
Ribosomes / metabolism
Software*

Substances

Proteins
RNA, Long Noncoding