Machine learning algorithms for rapid estimation of holocellulose content of poplar clones based on Raman spectroscopy

Carbohydr Polym. 2022 Sep 15:292:119635. doi: 10.1016/j.carbpol.2022.119635. Epub 2022 May 19.

Abstract

In this study, regularization algorithms (RR, LR, and ENR), classical ML algorithms (SVR, DT, and RF), and advanced GBM algorithms (LightGBM, CatBoost, and XGBoost) were applied to build the holocellulose content predictive models of poplar based on features extracted from Raman spectra. Evaluation results of models indicate that classical ML algorithms show higher predictive accuracy than regularization algorithms, and the advanced GBM algorithms better than the classical ML algorithms. Furthermore, models built by CatBoost and XGBoost can estimate holocellulose content with high predictive accuracy of test R2 above 0.93 and test RMSE less than 0.29%. It provides the heretofore best precision of holocellulose content predictive model based on Raman spectroscopy so far for our knowledge. Therefore, it is suggested that Raman spectroscopy coupled with ML algorithms is a promising tool for predicting the holocellulose content in poplar which can be applied in large-scale tree genetic and breeding programs.

Keywords: CatBoost; Holocellulose content; Machine learning algorithms; Raman spectroscopy; XGBoost.

MeSH terms

  • Algorithms
  • Clone Cells
  • Machine Learning
  • Populus*
  • Spectrum Analysis, Raman*