Identifying N7-methylguanosine sites by integrating multiple features

Biopolymers. 2022 Feb;113(2):e23480. doi: 10.1002/bip.23480. Epub 2021 Oct 28.

Abstract

Recent studies reported that N7-methylguanosine (m7G) plays a vital role in gene expression regulation. As a consequence, determining the distribution of m7G is a crucial step towards further understanding its biological functions. Although biological experimental approaches are capable of accurately locating m7G sites, they are labor-intensive, costly, and time-consuming. Therefore, it is necessary to develop more effective and robust computational methods to replace, or at least complement current experimental methods. In this study, we developed a novel sequence-based computational tool to identify RNA m7G sites. In this model, 22 kinds of dinucleotide physicochemical (PC) properties were employed to encode the RNA sequence. Three types of descriptors, including auto-covariance, cross-covariance, and discrete wavelet transform were adopted to extract effective features from the PC matrix. The least absolute shrinkage and selection operator (LASSO) algorithm was utilized to reduce the influence of irrelevant or redundant features. Finally, these selected features were fed into a support vector machine (SVM) for distinguishing m7G from non-m7G sites. The proposed method significantly outperforms existing predictors across all evaluation metrics. It indicates that the approach is effective in identifying RNA m7G sites.

Keywords: LASSO; SVM; dinucleotide physicochemical properties; m7G sites.

MeSH terms

  • Algorithms
  • Guanosine* / analogs & derivatives
  • Guanosine* / genetics
  • RNA / chemistry
  • Support Vector Machine*

Substances

  • 8-methylguanosine
  • Guanosine
  • RNA