Clustering FunFams using sequence embeddings improves EC purity

Maria Littmann; Nicola Bordin; Michael Heinzinger; Konstantin Schütze; Christian Dallago; Christine Orengo; Burkhard Rost

doi:10.1093/bioinformatics/btab371

Clustering FunFams using sequence embeddings improves EC purity

Bioinformatics. 2021 Oct 25;37(20):3449-3455. doi: 10.1093/bioinformatics/btab371.

Authors

Maria Littmann^{1

2}, Nicola Bordin³, Michael Heinzinger^{1

2}, Konstantin Schütze¹, Christian Dallago^{1

2}, Christine Orengo³, Burkhard Rost^{1

4

5}

Affiliations

¹ Department of Informatics, Bioinformatics & Computational Biology-i12, TUM (Technical University of Munich), 85748 Garching/Munich, Germany.
² Center for Doctoral Studies in Informatics and its Applications (CeDoSIA), TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), 85748 Garching/Munich, Germany.
³ Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK.
⁴ Institute for Advanced Study (TUM-IAS), 85748 Garching/Munich, Germany.
⁵ TUM School of Life Sciences Weihenstephan (WZW), 85354 Freising, Germany.

Abstract

Motivation: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be 'pure', i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations.

Results: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes.

Availability and implementation: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering.

Supplementary information: Supplementary data are available at Bioinformatics online.

Abstract

Grants and funding