Modeling the Amplification of Immunoglobulins through Machine Learning on Sequence-Specific Features

Matthias Döring; Christoph Kreer; Nathalie Lehnen; Florian Klein; Nico Pfeifer

doi:10.1038/s41598-019-47173-w

Modeling the Amplification of Immunoglobulins through Machine Learning on Sequence-Specific Features

Sci Rep. 2019 Jul 24;9(1):10748. doi: 10.1038/s41598-019-47173-w.

Authors

Matthias Döring¹, Christoph Kreer^{2

3}, Nathalie Lehnen^{2

3

4}, Florian Klein^{2

3

4}, Nico Pfeifer^{5

6

7

8}

Affiliations

¹ Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, 66123, Saarbrücken, Germany.
² Institute of Virology, University of Cologne, Fürst-Pückler-Str. 56, 50935, Cologne, Germany.
³ Center for Molecular Medicine, University Hospital of Cologne, Robert-Koch-Straße 21, 50931, Cologne, Germany.
⁴ German Center for Infection Research, Cologne-Bonn Partner Site, Cologne, Germany.
⁵ Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, 66123, Saarbrücken, Germany. pfeifer@informatik.uni-tuebingen.de.
⁶ Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Sand 14, 72076, Tübingen, Germany. pfeifer@informatik.uni-tuebingen.de.
⁷ Medical Faculty, Geissweg 5, University of Tübingen, 72076, Tübingen, Germany. pfeifer@informatik.uni-tuebingen.de.
⁸ German Center for Infection Research, Tübingen Partner Site, Tübingen, Germany. pfeifer@informatik.uni-tuebingen.de.

Abstract

Successful primer design for polymerase chain reaction (PCR) hinges on the ability to identify primers that efficiently amplify template sequences. Here, we generated a novel Taq PCR data set that reports the amplification status for pairs of primers and templates from a reference set of 47 immunoglobulin heavy chain variable sequences and 20 primers. Using logistic regression, we developed TMM, a model for predicting whether a primer amplifies a template given their nucleotide sequences. The model suggests that the free energy of annealing, ΔG, is the key driver of amplification (p = 7.35e-12) and that 3' mismatches should be considered in dependence on ΔG and the mismatch closest to the 3' terminus (p = 1.67e-05). We validated TMM by comparing its estimates with those from the thermodynamic model of DECIPHER (DE) and a model based solely on the free energy of annealing (FE). TMM outperformed the other approaches in terms of the area under the receiver operating characteristic curve (TMM: 0.953, FE: 0.941, DE: 0.896). TMM can improve primer design and is freely available via openPrimeR ( http://openPrimeR.mpi-inf.mpg.de ).

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

DNA Primers / genetics
DNA Primers / metabolism
Humans
Immunoglobulins / genetics
Immunoglobulins / metabolism*
Logistic Models
Machine Learning
Models, Statistical
Nucleic Acid Amplification Techniques / methods
Polymerase Chain Reaction / methods*

Substances

DNA Primers
Immunoglobulins

Associated data

figshare/10.6084/m9.figshare.6736175
figshare/10.6084/m9.figshare.6736232