Discovering misannotated lncRNAs using deep learning training dynamics

Afshan Nabi; Berke Dilekoglu; Ogun Adebali; Oznur Tastan

doi:10.1093/bioinformatics/btac821

Discovering misannotated lncRNAs using deep learning training dynamics

Bioinformatics. 2023 Jan 1;39(1):btac821. doi: 10.1093/bioinformatics/btac821.

Authors

Afshan Nabi¹, Berke Dilekoglu¹, Ogun Adebali¹, Oznur Tastan¹

Affiliation

¹ Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul 34956, Turkey.

Abstract

Motivation: Recent experimental evidence has shown that some long non-coding RNAs (lncRNAs) contain small open reading frames (sORFs) that are translated into functional micropeptides, suggesting that these lncRNAs are misannotated as non-coding. Current methods to detect misannotated lncRNAs rely on ribosome-profiling (Ribo-Seq) and mass-spectrometry experiments, which are cell-type dependent and expensive.

Results: Here, we propose a computational method to identify possible misannotated lncRNAs from sequence information alone. Our approach first builds deep learning models to discriminate coding and non-coding transcripts and leverages these models' training dynamics to identify misannotated lncRNAs-i.e. lncRNAs with coding potential. The set of misannotated lncRNAs we identified significantly overlap with experimentally validated ones and closely resemble coding protein sequences as evidenced by significant BLAST hits. Our analysis on a subset of misannotated lncRNA candidates also shows that some ORFs they contain yield high confidence folded structures as predicted by AlphaFold2. This methodology offers promising potential for assisting experimental efforts in characterizing the hidden proteome encoded by misannotated lncRNAs and for curating better datasets for building coding potential predictors.

Availability and implementation: Source code is available at https://github.com/nabiafshan/DetectingMisannotatedLncRNAs.

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

Amino Acid Sequence
Deep Learning*
Micropeptides
Open Reading Frames
Proteome / genetics
RNA, Long Noncoding* / genetics

Substances

RNA, Long Noncoding
Proteome