Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning

Tony Jha; Jovinna Mendel; Hyuk Cho; Madhusudan Choudhary

doi:10.1177/11779322221118335

Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning

Bioinform Biol Insights. 2022 Aug 18:16:11779322221118335. doi: 10.1177/11779322221118335. eCollection 2022.

Authors

Tony Jha¹, Jovinna Mendel², Hyuk Cho³, Madhusudan Choudhary²

Affiliations

¹ Department of Mathematics, University of California, Berkeley, Berkeley, CA, USA.
² Department of Biological Sciences, Sam Houston State University, Huntsville, TX, USA.
³ Department of Computer Science, Sam Houston State University, Huntsville, TX, USA.

Abstract

Small ribonucleic acid (sRNA) sequences are 50-500 nucleotide long, noncoding RNA (ncRNA) sequences that play an important role in regulating transcription and translation within a bacterial cell. As such, identifying sRNA sequences within an organism's genome is essential to understand the impact of the RNA molecules on cellular processes. Recently, numerous machine learning models have been applied to predict sRNAs within bacterial genomes. In this study, we considered the sRNA prediction as an imbalanced binary classification problem to distinguish minor positive sRNAs from major negative ones within imbalanced data and then performed a comparative study with six learning algorithms and seven assessment metrics. First, we collected numerical feature groups extracted from known sRNAs previously identified in Salmonella typhimurium LT2 (SLT2) and Escherichia coli K12 (E. coli K12) genomes. Second, as a preliminary study, we characterized the sRNA-size distribution with the conformity test for Benford's law. Third, we applied six traditional classification algorithms to sRNA features and assessed classification performance with seven metrics, varying positive-to-negative instance ratios, and utilizing stratified 10-fold cross-validation. We revisited important individual features and feature groups and found that classification with combined features perform better than with either an individual feature or a single feature group in terms of Area Under Precision-Recall curve (AUPR). We reconfirmed that AUPR properly measures classification performance on imbalanced data with varying imbalance ratios, which is consistent with previous studies on classification metrics for imbalanced data. Overall, eXtreme Gradient Boosting (XGBoost), even without exploiting optimal hyperparameter values, performed better than the other five algorithms with specific optimal parameter settings. As a future work, we plan to extend XGBoost further to a large amount of published sRNAs in bacterial genomes and compare its classification performance with recent machine learning models' performance.

Keywords: AdaBoost; XGBoost; accuracy paradox; imbalance data; machine learning; sRNA; sRNA prediction.