Variability analysis of LC-MS experimental factors and their impact on machine learning

Tobias Greisager Rehfeldt; Konrad Krawczyk; Simon Gregersen Echers; Paolo Marcatili; Pawel Palczynski; Richard Röttger; Veit Schwämmle

doi:10.1093/gigascience/giad096

Variability analysis of LC-MS experimental factors and their impact on machine learning

Gigascience. 2022 Dec 28:12:giad096. doi: 10.1093/gigascience/giad096.

Authors

Tobias Greisager Rehfeldt¹, Konrad Krawczyk¹, Simon Gregersen Echers², Paolo Marcatili³, Pawel Palczynski⁴, Richard Röttger¹, Veit Schwämmle⁴

Affiliations

¹ Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark.
² Department of Chemistry and Bioscience, Aalborg University, 9220 Aalborg, Denmark.
³ Department of Health Technology, Technical University of Denmark, 2800 Kongens Lyngby, Denmark.
⁴ Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230 Odense, Denmark.

Abstract

Background: Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data-processing pipeline from raw data analysis to end-user predictions and rescoring. ML models need large-scale datasets for training and repurposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.

Results: We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variability in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.

Conclusions: Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it is important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pretrained model.

Keywords: bioinformatics; data mining; deep learning; machine learning; mass spectrometry; proteomics; statistics; transfer learning.

MeSH terms

Chromatography, Liquid
Machine Learning*
Tandem Mass Spectrometry*

Grants and funding

00028116/Velux Foundation