Automated optimization and construction of chemometric models based on highly variable raw chromatographic data

Anal Chim Acta. 2011 Jul 4;697(1-2):8-15. doi: 10.1016/j.aca.2011.04.029. Epub 2011 Apr 23.

Abstract

Direct chemometric interpretation of raw chromatographic data (as opposed to integrated peak tables) has been shown to be advantageous in many circumstances. However, this approach presents two significant challenges: data alignment and feature selection. In order to interpret the data, the time axes must be precisely aligned so that the signal from each analyte is recorded at the same coordinates in the data matrix for each and every analyzed sample. Several alignment approaches exist in the literature and they work well when the samples being aligned are reasonably similar. In cases where the background matrix for a series of samples to be modeled is highly variable, the performance of these approaches suffers. Considering the challenge of feature selection, when the raw data are used each signal at each time is viewed as an individual, independent variable; with the data rates of modern chromatographic systems, this generates hundreds of thousands of candidate variables, or tens of millions of candidate variables if multivariate detectors such as mass spectrometers are utilized. Consequently, an automated approach to identify and select appropriate variables for inclusion in a model is desirable. In this research we present an alignment approach that relies on a series of deuterated alkanes which act as retention anchors for an alignment signal, and couple this with an automated feature selection routine based on our novel cluster resolution metric for the construction of a chemometric model. The model system that we use to demonstrate these approaches is a series of simulated arson debris samples analyzed by passive headspace extraction, GC-MS, and interpreted using partial least squares discriminant analysis (PLS-DA).