A Linear Regression Predictor for Identifying N6-Methyladenosine Sites Using Frequent Gapped K-mer Pattern

Y Y Zhuang; H J Liu; X Song; Y Ju; H Peng

doi:10.1016/j.omtn.2019.10.001

A Linear Regression Predictor for Identifying N⁶-Methyladenosine Sites Using Frequent Gapped K-mer Pattern

Mol Ther Nucleic Acids. 2019 Dec 6:18:673-680. doi: 10.1016/j.omtn.2019.10.001. Epub 2019 Oct 10.

Authors

Y Y Zhuang¹, H J Liu², X Song³, Y Ju¹, H Peng¹

Affiliations

¹ School of Informatics, Xiamen University, Xiamen 361005, China.
² College of Information Technology and Computer Science, University of the Cordilleras, Baguio 2600, Philippines.
³ School of Computer and Information Technology, Nanyang Normal University, Nanyang 473000, China. Electronic address: sxyoland@foxmail.com.

Abstract

N6-methyladenosine (m⁶A) is one of the most common and abundant modifications in RNA, which is related to many biological processes in humans. Abnormal RNA modifications are often associated with a series of diseases, including tumors, neurogenic diseases, and embryonic retardation. Therefore, identifying m⁶A sites is of paramount importance in the post-genomic age. Although many lab-based methods have been proposed to annotate m⁶A sites, they are time consuming and cost ineffective. In view of the drawbacks of the intrinsic methods in RNA sequence recognition, computational methods are suggested as a supplement to identify m⁶A sites. In this study, we develop a novel feature extraction algorithm based on the frequent gapped k-mer pattern (FGKP) and apply the linear regression to construct the prediction model. The new predictor is used to identify m⁶A sites in the Saccharomyces cerevisiae database. It has been shown by the 10-fold cross-validation that the performance is better than that of recent methods. Comparative results indicate that our model has great potential to become a useful and effective tool for genome analysis and gain more insights for locating m⁶A sites.

Keywords: 10-fold cross-validation; N6-methyladenosine; RNA modifications; Saccharomyces cerevisiae database; frequent gapped k-mer pattern; genome analysis; linear regression; novel feature extraction algorithm.