Measuring functional similarity of lncRNAs based on variable K-mer profiles of nucleotide sequences

Methods. 2023 Apr:212:21-30. doi: 10.1016/j.ymeth.2023.02.009. Epub 2023 Feb 20.

Abstract

Long non-coding RNAs are a class of essential non-coding RNAs with a length of more than 200 nts. Recent studies have indicated that lncRNAs have various complex regulatory functions, which play great impacts on many fundamental biological processes. However, measuring the functional similarity between lncRNAs by traditional wet-experiments is time-consuming and labor intensive, computational-based approaches have been an effective choice to tackle this problem. Meanwhile, most sequences-based computation methods measure the functional similarity of lncRNAs with their fixed length vector representations, which could not capture the features on larger k-mers. Therefore, it is urgent to improve the predict performance of the potential regulatory functions of lncRNAs. In this study, we propose a novel approach called MFSLNC to comprehensively measure functional similarity of lncRNAs based on variable k-mer profiles of nucleotide sequences. MFSLNC employs the dictionary tree storage, which could comprehensively represent lncRNAs with long k-mers. The functional similarity between lncRNAs is evaluated by the Jaccard similarity. MFSLNC verified the similarity between two lncRNAs with the same mechanism, detecting homologous sequence pairs between human and mouse. Besides, MFSLNC is also applied to lncRNA-disease associations, combined with the association prediction model WKNKN. Moreover, we also proved that our method can more effectively calculate the similarity of lncRNAs by comparing with the classical methods based on the lncRNA-mRNA association data. The detected AUC value of prediction is 0.867, which achieves good performance in the comparison of similar models.

Keywords: K-mer profiles; RNA binding sites; lncRNA functional similarity.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Animals
  • Base Sequence
  • Computational Biology / methods
  • Humans
  • Mice
  • RNA, Long Noncoding* / genetics

Substances

  • RNA, Long Noncoding