Predicting the absence of an unknown compound in a mass spectral database

Eur J Mass Spectrom (Chichester). 2019 Dec;25(6):439-444. doi: 10.1177/1469066719855503. Epub 2019 Jun 10.

Abstract

Only a small subset of known organic compounds (amenable for gas chromatography/mass spectrometry) is present in the largest mass spectral databases (such as NIST or Wiley). Nevertheless, library search algorithms available in the market are not able to predict the absence of a compound in the database. In the present work, we have tried to implement such prediction by means of supervised classification. Training and validation set contained 1500 and 750 compounds, respectively. Two prediction sets (containing 750 and about 3000 mass spectra) were considered. The easiest-to-use models were built with only one input variable: match factor of the best candidate or InLib factor (both parameters were calculated within MS Search (NIST) software). Multivariate classification models were built by partial least squares discriminant analysis (PLS-DA); match factors of top n candidates were used as input variables. PLS-DA was found to be the most effective approach. The prediction efficiency strongly depended on the 'uniqueness' of mass spectra presented in the test set. PLS-DA model was able to correctly predict the absence of a compound in the database in 29.9% for prediction set #1 and in 74.4% for prediction set #2 (only 1.3% and 2.5% of compounds actually presented in the database were wrongly classified).

Keywords: MS Search; Mass spectral library; PLS-DA; library search; partial least squares discriminant analysis.