For antibody sequence generative modeling, mixture models may be all you need

Jonathan Parkinson; Wei Wang

doi:10.1093/bioinformatics/btae278

For antibody sequence generative modeling, mixture models may be all you need

Bioinformatics. 2024 May 2;40(5):btae278. doi: 10.1093/bioinformatics/btae278.

Authors

Jonathan Parkinson^{1

2}, Wei Wang^{1

3}

Affiliations

¹ Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359, United States.
² MAP Bioscience, La Jolla, CA 92093, United States.
³ Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA 92093-0359, United States.

Abstract

Motivation: Antibody therapeutic candidates must exhibit not only tight binding to their target but also good developability properties, especially low risk of immunogenicity.

Results: In this work, we fit a simple generative model, SAM, to sixty million human heavy and seventy million human light chains. We show that the probability of a sequence calculated by the model distinguishes human sequences from other species with the same or better accuracy on a variety of benchmark datasets containing >400 million sequences than any other model in the literature, outperforming large language models (LLMs) by large margins. SAM can humanize sequences, generate new sequences, and score sequences for humanness. It is both fast and fully interpretable. Our results highlight the importance of using simple models as baselines for protein engineering tasks. We additionally introduce a new tool for numbering antibody sequences which is orders of magnitude faster than existing tools in the literature.

Availability and implementation: All tools developed in this study are available at https://github.com/Wang-lab-UCSD/AntPack.

MeSH terms

Algorithms
Antibodies* / chemistry
Computational Biology / methods
Humans
Immunoglobulin Heavy Chains / chemistry
Immunoglobulin Heavy Chains / immunology
Immunoglobulin Light Chains / chemistry
Immunoglobulin Light Chains / immunology
Sequence Analysis, Protein / methods
Software

Grants and funding

R21AI58114/NH/NIH HHS/United States