Sequence similarity governs generalizability of de novo deep learning models for RNA secondary structure prediction

PLoS Comput Biol. 2023 Apr 17;19(4):e1011047. doi: 10.1371/journal.pcbi.1011047. eCollection 2023 Apr.

Abstract

Making no use of physical laws or co-evolutionary information, de novo deep learning (DL) models for RNA secondary structure prediction have achieved far superior performances than traditional algorithms. However, their statistical underpinning raises the crucial question of generalizability. We present a quantitative study of the performance and generalizability of a series of de novo DL models, with a minimal two-module architecture and no post-processing, under varied similarities between seen and unseen sequences. Our models demonstrate excellent expressive capacities and outperform existing methods on common benchmark datasets. However, model generalizability, i.e., the performance gap between the seen and unseen sets, degrades rapidly as the sequence similarity decreases. The same trends are observed from several recent DL and machine learning models. And an inverse correlation between performance and generalizability is revealed collectively across all learning-based models with wide-ranging architectures and sizes. We further quantitate how generalizability depends on sequence and structure identity scores via pairwise alignment, providing unique quantitative insights into the limitations of statistical learning. Generalizability thus poses a major hurdle for deploying de novo DL models in practice and various pathways for future advances are discussed.

MeSH terms

  • Algorithms
  • Deep Learning*
  • Machine Learning
  • Protein Structure, Secondary
  • RNA* / genetics

Substances

  • RNA

Grants and funding

The author received no specific funding for this work.