Nearest neighbor search on embeddings rapidly identifies distant protein relations

Konstantin Schütze; Michael Heinzinger; Martin Steinegger; Burkhard Rost

doi:10.3389/fbinf.2022.1033775

Nearest neighbor search on embeddings rapidly identifies distant protein relations

Front Bioinform. 2022 Nov 17:2:1033775. doi: 10.3389/fbinf.2022.1033775. eCollection 2022.

Authors

Konstantin Schütze¹, Michael Heinzinger^{1

2}, Martin Steinegger^{3

4}, Burkhard Rost^{1

5}

Affiliations

¹ TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology-i12, Munich, Germany.
² TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching, Germany.
³ School of Biological Sciences, Seoul National University, Seoul, South Korea.
⁴ Artificial Intelligence Institute, Seoul National University, Seoul, South Korea.
⁵ Institute for Advanced Study (TUM-IAS), Germany & TUM School of Life Sciences Weihenstephan (WZW), Freising, Germany.

Abstract

Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as "homology detection") use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others.

Keywords: homology search; language models; nearest neighbor search; protein embeddings; remote homology detection.