simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods

Chakravarthi Kanduri; Lonneke Scheffer; Milena Pavlović; Knut Dagestad Rand; Maria Chernigovskaya; Oz Pirvandy; Gur Yaari; Victor Greiff; Geir K Sandve

doi:10.1093/gigascience/giad074

simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods

Gigascience. 2022 Dec 28:12:giad074. doi: 10.1093/gigascience/giad074.

Authors

Chakravarthi Kanduri^{1

2}, Lonneke Scheffer¹, Milena Pavlović^{1

2}, Knut Dagestad Rand¹, Maria Chernigovskaya³, Oz Pirvandy⁴, Gur Yaari⁴, Victor Greiff³, Geir K Sandve^{1

2}

Affiliations

¹ Centre for Bioinformatics, Department of Informatics, University of Oslo, 0373 Oslo, Norway.
² UiORealArt Convergence Environment, University of Oslo, 0373 Oslo, Norway.
³ Department of Immunology and Oslo University Hospital, University of Oslo, 0373 Oslo, Norway.
⁴ Faculty of Engineering, Bar-Ilan University, 5290002, Israel.

Abstract

Background: Machine learning (ML) has gained significant attention for classifying immune states in adaptive immune receptor repertoires (AIRRs) to support the advancement of immunodiagnostics and therapeutics. Simulated data are crucial for the rigorous benchmarking of AIRR-ML methods. Existing approaches to generating synthetic benchmarking datasets result in the generation of naive repertoires missing the key feature of many shared receptor sequences (selected for common antigens) found in antigen-experienced repertoires.

Results: We demonstrate that a common approach to generating simulated AIRR benchmark datasets can introduce biases, which may be exploited for undesired shortcut learning by certain ML methods. To mitigate undesirable access to true signals in simulated AIRR datasets, we devised a simulation strategy (simAIRR) that constructs antigen-experienced-like repertoires with a realistic overlap of receptor sequences. simAIRR can be used for constructing AIRR-level benchmarks based on a range of assumptions (or experimental data sources) for what constitutes receptor-level immune signals. This includes the possibility of making or not making any prior assumptions regarding the similarity or commonality of immune state-associated sequences that will be used as true signals. We demonstrate the real-world realism of our proposed simulation approach by showing that basic ML strategies perform similarly on simAIRR-generated and real-world experimental AIRR datasets.

Conclusions: This study sheds light on the potential shortcut learning opportunities for ML methods that can arise with the state-of-the-art way of simulating AIRR datasets. simAIRR is available as a Python package: https://github.com/KanduriC/simAIRR.

Keywords: AIRR; ML; adaptive immune receptor repertoires; benchmarking of machine learning methods; shortcut learning; simulation of AIRR data.

MeSH terms

Benchmarking*
Computer Simulation