Thousands of protein linear motif classes may still be undiscovered

Denys Bulavka; Ariel A Aptekmann; Nicolás A Méndez; Teresa Krick; Ignacio E Sánchez

doi:10.1371/journal.pone.0248841

Thousands of protein linear motif classes may still be undiscovered

PLoS One. 2021 May 3;16(5):e0248841. doi: 10.1371/journal.pone.0248841. eCollection 2021.

Authors

Denys Bulavka^{1

2}, Ariel A Aptekmann^{1

3}, Nicolás A Méndez¹, Teresa Krick⁴, Ignacio E Sánchez¹

Affiliations

¹ Laboratorio de Fisiología de Proteínas, Facultad de Ciencias Exactas y Naturales, Consejo Nacional de lnvestigaciones Cientificas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos Aires, Argentina.
² Departamento de Matematica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina.
³ Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, NJ, United States of America.
⁴ Departamento de Matematica, Facultad de Ciencias Exactas y Naturales and IMAS-CONICET, Universidad de Buenos Aires, Buenos Aires, Argentina.

Abstract

Linear motifs are short protein subsequences that mediate protein interactions. Hundreds of motif classes including thousands of motif instances are known. Our theory estimates how many motif classes remain undiscovered. As commonly done, we describe motif classes as regular expressions specifying motif length and the allowed amino acids at each motif position. We measure motif specificity for a pair of motif classes by quantifying how many motif-discriminating positions prevent a protein subsequence from matching the two classes at once. We derive theorems for the maximal number of motif classes that can simultaneously maintain a certain number of motif-discriminating positions between all pairs of classes in the motif universe, for a given amino acid alphabet. We also calculate the fraction of all protein subsequences that would belong to a motif class if all potential motif classes came into existence. Naturally occurring pairs of motif classes present most often a single motif-discriminating position. This mild specificity maximizes the potential number of coexisting motif classes, the expansion of the motif universe due to amino acid modifications and the fraction of amino acid sequences that code for a motif instance. As a result, thousands of linear motif classes may remain undiscovered.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Amino Acid Motifs*
Humans
Sensitivity and Specificity
Sequence Analysis, Protein / methods*
Sequence Analysis, Protein / standards

Grants and funding

We acknowledge funding from ANPCyT, PICT 2012-2550 and ANPCyT, PICT 2015-1213 to I.E.S. and CONICET, PIP 2014 11220130100073CO to T.K., I.E.S and T.K. are CONICET career investigators. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.