Improving spectral library search by redefining similarity measures

J Chem Inf Model. 2015 May 26;55(5):963-71. doi: 10.1021/acs.jcim.5b00077. Epub 2015 May 5.

Abstract

Similarity plays a central role in spectral library search. The goal of spectral library search is to identify those spectra in a reference library of known materials that most closely match an unknown query spectrum, on the assumption that this will allow us to identify the main constituent(s) of the query spectrum. The similarity measures used for this task in software and the academic literature are almost exclusively metrics, meaning that the measures obey the three axioms of metrics: (1) minimality; (2) symmetry; (3) triangle inequality. Consequently, they implicitly assume that the query spectrum is drawn from the same distribution as that of the reference library. In this paper, we demonstrate that this assumption is not necessary in practical spectral library search and that in fact it is often violated in practice. Although the reference library may be constructed carefully, it is generally impossible to guarantee that all future query spectra will be drawn from the same distribution as the reference library. Before evaluating different similarity measures, we need to understand how they define the relationship between spectra. In spectral library search, we often aim to find the constituent(s) of a mixture. We propose that, rather than asking which reference library spectra are similar to the mixture, we should ask which of the reference library spectra are contained in the given query mixture. This question is inherently asymmetric. Therefore, we should adopt a nonmetric measure. To evaluate our hypothesis, we apply a nonmetric measure formulated by Tversky [Psychol. Rev. 1977, 84, 327-352] known as the Contrast Model and compare its performance to the well-known Jaccard similarity index metric on spectroscopic data sets. Our results show that the Tversky similarity measure yields better results than the Jaccard index.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Data Mining / methods*
  • Drug Discovery / methods*
  • Halogenation
  • Spectrum Analysis*