A new strategy to prevent over-fitting in partial least squares models based on model population analysis

Bai-Chuan Deng; Yong-Huan Yun; Yi-Zeng Liang; Dong-Sheng Cao; Qing-Song Xu; Lun-Zhao Yi; Xin Huang

doi:10.1016/j.aca.2015.04.045

A new strategy to prevent over-fitting in partial least squares models based on model population analysis

Anal Chim Acta. 2015 Jun 23:880:32-41. doi: 10.1016/j.aca.2015.04.045. Epub 2015 Apr 25.

Authors

Bai-Chuan Deng¹, Yong-Huan Yun², Yi-Zeng Liang³, Dong-Sheng Cao⁴, Qing-Song Xu⁵, Lun-Zhao Yi⁶, Xin Huang²

Affiliations

¹ Department of Chemistry, University of Bergen, Bergen N-5007, Norway; School of Chemistry and Chemical Engineering, Central South University, Changsha 410083, PR China.
² School of Chemistry and Chemical Engineering, Central South University, Changsha 410083, PR China.
³ School of Chemistry and Chemical Engineering, Central South University, Changsha 410083, PR China. Electronic address: yizeng_liang@263.net.
⁴ School of Pharmaceutical Sciences, Central South University, Changsha 410083, PR China. Electronic address: oriental-cds@163.com.
⁵ School of Mathematics and Statistics, Central South University, Changsha 410083, PR China.
⁶ Yunnan Food Safety Research Institute, Kunming University of Science and Technology, Kunming 650500, PR China.

PMID: 26092335
DOI: 10.1016/j.aca.2015.04.045

Abstract

Partial least squares (PLS) is one of the most widely used methods for chemical modeling. However, like many other parameter tunable methods, it has strong tendency of over-fitting. Thus, a crucial step in PLS model building is to select the optimal number of latent variables (nLVs). Cross-validation (CV) is the most popular method for PLS model selection because it selects a model from the perspective of prediction ability. However, a clear minimum of prediction errors may not be obtained in CV which makes the model selection difficult. To solve the problem, we proposed a new strategy for PLS model selection which combines the cross-validated coefficient of determination (Qcv(2)) and model stability (S). S is defined as the stability of PLS regression vectors which is obtained using model population analysis (MPA). The results show that, when a clear maximum of Qcv(2) is not obtained, S can provide additional information of over-fitting and it helps in finding the optimal nLVs. Compared with other regression vector based indictors such as the Euclidean 2-norm (B2), the Durbin Watson statistic (DW) and the jaggedness (J), S is more sensitive to over-fitting. The model selected by our method has both good prediction ability and stability.

Keywords: Cross-validation; Model population analysis; Model selection; Model stability; Over-fitting; Partial least squares.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Glycine max / chemistry
Glycine max / metabolism
Least-Squares Analysis
Models, Chemical*
Software
Spectrophotometry, Ultraviolet