Classification of helical polymers with deep-learning language models

J Struct Biol. 2023 Dec;215(4):108041. doi: 10.1016/j.jsb.2023.108041. Epub 2023 Nov 7.

Abstract

Many macromolecules in biological systems exist in the form of helical polymers. However, the inherent polymorphism and heterogeneity of samples complicate the reconstruction of helical polymers from cryo-EM images. Currently, available 2D classification methods are effective at separating particles of interest from contaminants, but they do not effectively differentiate between polymorphs, resulting in heterogeneity in the 2D classes. As such, it is crucial to develop a method that can computationally divide a dataset of polymorphic helical structures into homogenous subsets. In this work, we utilized deep-learning language models to embed the filaments as vectors in hyperspace and group them into clusters. Tests with both simulated and experimental datasets have demonstrated that our method - HLM (Helical classification with Language Model) can effectively distinguish different types of filaments, in the presence of many contaminants and low signal-to-noise ratios. We also demonstrate that HLM can isolate homogeneous subsets of particles from a publicly available dataset, resulting in the discovery of a previously unreported filament variant with an extra density around the tau filaments.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Cryoelectron Microscopy / methods
  • Cytoskeleton
  • Deep Learning*
  • Macromolecular Substances
  • Polymers*

Substances

  • Polymers
  • Macromolecular Substances