Text mining of CHO bioprocess bibliome: Topic modeling and document classification

Qinghua Wang; Jonathan Olshin; K Vijay-Shanker; Cathy H Wu

doi:10.1371/journal.pone.0274042

Text mining of CHO bioprocess bibliome: Topic modeling and document classification

PLoS One. 2023 Apr 6;18(4):e0274042. doi: 10.1371/journal.pone.0274042. eCollection 2023.

Authors

Qinghua Wang^{1

2}, Jonathan Olshin^{1

2}, K Vijay-Shanker¹, Cathy H Wu^{1

2

3}

Affiliations

¹ Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America.
² Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware, United States of America.
³ Department of Biochemistry and Molecular & Cellular Biology, Protein Information Resource, Georgetown University Medical Center, Washington, The District of Columbia, United States of America.

Abstract

Chinese hamster ovary (CHO) cells are widely used for mass production of therapeutic proteins in the pharmaceutical industry. With the growing need in optimizing the performance of producer CHO cell lines, research on CHO cell line development and bioprocess continues to increase in recent decades. Bibliographic mapping and classification of relevant research studies will be essential for identifying research gaps and trends in literature. To qualitatively and quantitatively understand the CHO literature, we have conducted topic modeling using a CHO bioprocess bibliome manually compiled in 2016, and compared the topics uncovered by the Latent Dirichlet Allocation (LDA) models with the human labels of the CHO bibliome. The results show a significant overlap between the manually selected categories and computationally generated topics, and reveal the machine-generated topic-specific characteristics. To identify relevant CHO bioprocessing papers from new scientific literature, we have developed supervized models using Logistic Regression to identify specific article topics and evaluated the results using three CHO bibliome datasets, Bioprocessing set, Glycosylation set, and Phenotype set. The use of top terms as features supports the explainability of document classification results to yield insights on new CHO bioprocessing papers.

Copyright: © 2023 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Animals
CHO Cells
Cricetinae
Cricetulus
Data Mining*
Glycosylation
Humans
Phenotype

Grants and funding

R35 GM141873/GM/NIGMS NIH HHS/United States