PseU-KeMRF: A novel method for identifying RNA pseudouridine sites

IEEE/ACM Trans Comput Biol Bioinform. 2024 Apr 16:PP. doi: 10.1109/TCBB.2024.3389094. Online ahead of print.

Abstract

Pseudouridine is a type of abundant RNA modification that is seen in many different animals and is crucial for a variety of biological functions. Accurately identifying pseudouridine sites within the RNA sequence is vital for the subsequent study of various biological mechanisms of pseudouridine. However, the use of traditional experimental methods faces certain challenges. The development of fast and convenient computational methods is necessary to accurately identify pseudouridine sites from RNA sequence information. To address this, we introduce a novel pseudouridine site prediction model called PseU-KeMRF, which can identify pseudouridine sites in three species, H. sapiens, S. cerevisiae, and M. musculus. Through comprehensive analysis, we selected four RNA coding schemes, including binary feature, position-specific trinucleotide propensity based on single strand (PSTNPss), nucleotide chemical property (NCP) and pseudo k-tuple composition (PseKNC). Then the support vector machine-recursive feature elimination (SVM-RFE) method was used for feature selection and the feature subset was optimized. Finally, the best feature subsets are input into the kernel based on multinomial random forests (KeMRF) classifier for cross-validation and independent testing. As a new classification method, compared with the traditional random forest, KeMRF not only improves the node splitting process of decision tree construction based on multinomial distribution, but also combines the easy to interpret kernel method for prediction, which makes the classification performance better. Our results indicate superior predictive performance of PseU-KeMRF over other existing models. On the three species' training datasets, the testing accuracy of PseU-KeMRF was 0.66%, 3.66%, and 2.76% higher, respectively, than the best available methods. Moreover, PseU-KeMRF's accuracy on independent testing datasets was 15.15% and 11.0% higher, respectively, than the best available methods. The above results can prove that PseU-KeMRF is a highly competitive predictive model that can successfully identify pseudouridine sites in RNA sequences.