A novel riboswitch classification based on imbalanced sequences achieved by machine learning

PLoS Comput Biol. 2020 Jul 20;16(7):e1007760. doi: 10.1371/journal.pcbi.1007760. eCollection 2020 Jul.

Abstract

Riboswitch, a part of regulatory mRNA (50-250nt in length), has two main classes: aptamer and expression platform. One of the main challenges raised during the classification of riboswitch is imbalanced data. That is a circumstance in which the records of a sequences of one group are very small compared to the others. Such circumstances lead classifier to ignore minority group and emphasize on majority ones, which results in a skewed classification. We considered sixteen riboswitch families, to be in accord with recent riboswitch classification work, that contain imbalanced sequences. The sequences were split into training and test set using a newly developed pipeline. From 5460 k-mers (k value 1 to 6) produced, 156 features were calculated based on CfsSubsetEval and BestFirst function found in WEKA 3.8. Statistically tested result was significantly difference between balanced and imbalanced sequences (p < 0.05). Besides, each algorithm also showed a significant difference in sensitivity, specificity, accuracy, and macro F-score when used in both groups (p < 0.05). Several k-mers clustered from heat map were discovered to have biological functions and motifs at the different positions like interior loops, terminal loops and helices. They were validated to have a biological function and some are riboswitch motifs. The analysis has discovered the importance of solving the challenges of majority bias analysis and overfitting. Presented results were generalized evaluation of both balanced and imbalanced models, which implies their ability of classifying, to classify novel riboswitches. The Python source code is available at https://github.com/Seasonsling/riboswitch.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • Machine Learning*
  • Riboswitch / genetics*
  • Sequence Analysis, RNA / methods*
  • Software

Substances

  • Riboswitch

Grants and funding

This work was supported by the National Key Research and Development Program of China [2018YFC0310602; 2016YFA0501704]; National Natural Science Foundation of China [31571366, 31771477]; the Chinese Government Scholarship for foreign students(MOFCOM), Jiangsu Collaborative Innovation Center for Modern Crop Production, the Fundamental Research Funds for the Central Universities; and the Ministry of Education and Science of the Republic of North Macedonia. Opinions, results, and conclusions articulated in this paper are those of the authors and do not necessarily reflect the views of the supporting organization. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.