Penguin: A tool for predicting pseudouridine sites in direct RNA nanopore sequencing data

Doaa Hassan; Daniel Acevedo; Swapna Vidhur Daulatabad; Quoseena Mir; Sarath Chandra Janga

doi:10.1016/j.ymeth.2022.02.005

Penguin: A tool for predicting pseudouridine sites in direct RNA nanopore sequencing data

Methods. 2022 Jul:203:478-487. doi: 10.1016/j.ymeth.2022.02.005. Epub 2022 Feb 16.

Authors

Doaa Hassan¹, Daniel Acevedo², Swapna Vidhur Daulatabad³, Quoseena Mir³, Sarath Chandra Janga⁴

Affiliations

¹ Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University, 535 West Michigan Street, Indianapolis, IN 46202, United States; Computers and Systems Department, National Telecommunication Institute, Cairo, Egypt.
² Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University, 535 West Michigan Street, Indianapolis, IN 46202, United States; Computer Science Department, University of Texas Rio Grande Valley, United States.
³ Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University, 535 West Michigan Street, Indianapolis, IN 46202, United States.
⁴ Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University, 535 West Michigan Street, Indianapolis, IN 46202, United States; Department of Medical and Molecular Genetics, Indiana University School of Medicine, Medical Research and Library Building, 975 West Walnut Street, Indianapolis, IN 46202, United States; Centre for Computational Biology and Bioinformatics, Indiana University School of Medicine, 5021 Health Information and Translational Sciences (HITS), 410 West 10th Street, Indianapolis, IN 46202, United States. Electronic address: scjanga@iupui.edu.

Abstract

Pseudouridine is one of the most abundant RNA modifications, occurring when uridines are catalyzed by Pseudouridine synthase proteins. It plays an important role in many biological processes and has been reported to have application in drug development. Recently, the single-molecule sequencing techniques such as the direct RNA sequencing platform offered by Oxford Nanopore technologies have enabled direct detection of RNA modifications on the molecule being sequenced. In this study, we introduce a tool called Penguin that integrates several machine learning (ML) models to identify RNA Pseudouridine sites on Nanopore direct RNA sequencing reads. Pseudouridine sites were identified on single molecule sequencing data collected from direct RNA sequencing resulting in 723 K reads in Hek293 and 500 K reads in Hela cell lines. Penguin extracts a set of features from the raw signal measured by the Oxford Nanopore and the corresponding basecalled k-mer. Those features are used to train the predictors included in Penguin, which in turn, can predict whether the signal is modified by the presence of Pseudouridine sites in the testing phase. We have included various predictors in Penguin, including Support vector machines (SVM), Random Forest (RF), and Neural network (NN). The results on the two benchmark data sets for Hek293 and Hela cell lines show outstanding performance of Penguin either in random split testing or in independent validation testing. In random split testing, Penguin has been able to identify Pseudouridine sites with a high accuracy of 93.38% by applying SVM to Hek293 benchmark dataset. In independent validation testing, Penguin achieves an accuracy of 92.61% by training SVM with Hek293 benchmark dataset and testing it for identifying Pseudouridine sites on Hela benchmark dataset. Thus, Penguin outperforms the existing Pseudouridine predictors in the literature by 16 % higher accuracy than those predictors using independent validation testing. Employing penguin to predict Pseudouridine sites revealed a significant enrichment of "regulation of mRNA 3'-end processing" in Hek293 cell line and 'positive regulation of transcription from RNA polymerase II promoter involved in cellular response to chemical stimulus' in Hela cell line. Penguin software and models are available on GitHub at https://github.com/Janga-Lab/Penguin and can be readily employed for predicting Ψ sites from Nanopore direct RNA-sequencing datasets.

Keywords: Nanopore; Pseudouridine; RNA modifications.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Animals
HEK293 Cells
HeLa Cells
High-Throughput Nucleotide Sequencing
Humans
Nanopore Sequencing*
Nanopores*
Pseudouridine / chemistry
RNA / genetics
Sequence Analysis, RNA / methods
Spheniscidae* / genetics
Spheniscidae* / metabolism

Substances

Pseudouridine
RNA

Grants and funding

R01 GM123314/GM/NIGMS NIH HHS/United States