Three-step hybrid strategy towards efficiently selecting variables in multivariate calibration of near-infrared spectra

Spectrochim Acta A Mol Biomol Spectrosc. 2020 Jan 5:224:117376. doi: 10.1016/j.saa.2019.117376. Epub 2019 Jul 8.

Abstract

Variable (feature or wavelength) selection is a critical step in multivariate calibration of near-infrared (NIR) spectra. The high-resolution NIR or its imaging instruments usually generate hundreds or thousands of wavelengths, which make the variable selection methods tend to appear a high risk of overfitting, low efficiency, or requiring large computational abilities. Thus, it is a great challenge to efficiently select informative variables and obtain an optimal variable combination in a huge variable space. We propose a hybrid strategy for efficiently selecting variables based on three steps including rough selection, fine selection and optimal selection. The strong interpretability method like wavelength interval selection method (interval partial least squares, iPLS) was first used to roughly select informative intervals and shrink the variable space. Wavelength point selection methods such as variable importance in projection (VIP) and modified variable combination population analysis (mVCPA) were used to continuingly shrink the variable space from large to small in order to remain the very important variables. In the third step, applying some optimization methods such as iteratively retaining informative variables (IRIV) and genetic algorithm (GA) is to find an optimal variable combination from the remaining variables. It makes full use of the advantages of various involved methods and makes up for their disadvantages when facing high dimensional data. Two NIR datasets were employed to investigate the performance of the three-step hybrid strategy. It can significantly improve the prediction performance of the models built when compared with other single or hybrid methods (iPLS, VIP, iPLS-VIP, iPLS-VCPA, iPLS-mVCPA, VIP-GA, VIP-IRIV, mVCPA-GA, mVCPA-IRIV), indicating that the three-step hybrid strategy, including iPLS-VIP-IRIV, iPLS-VIP-GA, iPLS-mVCPA-GA and iPLS-mVCPA-IRIV, could efficiently select informative variables. Therefore, the three-step hybrid strategy is a good alternative for variable selection methods in the face of high dimensional NIR spectral data.

Keywords: Hybrid strategy; Multivariate calibration; Near-infrared spectra; Variable selection; Variable space.