Automatic feed phase identification in multivariate bioprocess profiles by sequential binary classification

Anal Chim Acta. 2017 Aug 22:982:48-61. doi: 10.1016/j.aca.2017.05.034. Epub 2017 Jun 22.

Abstract

In this paper, we propose a new strategy for retrospective identification of feed phases from online sensor-data enriched feed profiles of an Escherichia Coli (E. coli) fed-batch fermentation process. In contrast to conventional (static), data-driven multi-class machine learning (ML), we exploit process knowledge in order to constrain our classification system yielding more parsimonious models compared to static ML approaches. In particular, we enforce unidirectionality on a set of binary, multivariate classifiers trained to discriminate between adjacent feed phases by linking the classifiers through a one-way switch. The switch is activated when the actual classifier output changes. As a consequence, the next binary classifier in the classifier chain is used for the discrimination between the next feed phase pair etc. We allow activation of the switch only after a predefined number of consecutive predictions of a transition event in order to prevent premature activation of the switch and undertake a sensitivity analysis regarding the optimal choice of the (time) lag parameter. From a complexity/parsimony perspective the benefit of our approach is three-fold: i) The multi-class learning task is broken down into binary subproblems which usually have simpler decision surfaces and tend to be less susceptible to the class-imbalance problem. ii) We exploit the fact that the process follows a rigid feed cycle structure (i.e. batch-feed-batch-feed) which allows us to focus on the subproblems involving phase transitions as they occur during the process while discarding off-transition classifiers and iii) only one binary classifier is active at the time which keeps effective model complexity low. We further use a combination of logistic regression and Lasso (i.e. regularized logistic regression, RLR) as a wrapper to extract the most relevant features for individual subproblems from the whole set of high-dimensional sensor data. We train different soft computing classifiers, including decision trees (DT), k-nearest neighbors (k-NN), support vector machines (SVM) and an own developed fuzzy classifier and compare our method with conventional multi-class ML. Our results show a remarkable out-performance of the here proposed method over static ML approaches in terms of accuracy and robustness. We achieved close to error free feed phase classification while reducing the misclassification rates in 17 out of 20 investigated test cases in the range between 39% and 98.2% depending on feature set and classifier architecture. Models trained on features based on selection by RLR significantly outperformed those trained on features suggested by experts and their predictive performance was considerably less affected by the choice of the lag parameter.

Keywords: Bio-chemical reactors; Dynamic classification; Feed phase identification; Fermentation process; Fuzzy classifier; Regularized logistic regression.

MeSH terms

  • Algorithms
  • Batch Cell Culture Techniques*
  • Decision Trees
  • Escherichia coli
  • Fermentation*
  • Fuzzy Logic
  • Support Vector Machine*