Large scale text mining for deriving useful insights: A case study focused on microbiome

Syed Ashif Jardary Al Ahmed; Nishad Bapatdhar; Bipin Pradeep Kumar; Samik Ghosh; Ayako Yachie; Sucheendra K Palaniappan

doi:10.3389/fphys.2022.933069

Large scale text mining for deriving useful insights: A case study focused on microbiome

Front Physiol. 2022 Aug 31:13:933069. doi: 10.3389/fphys.2022.933069. eCollection 2022.

Authors

Syed Ashif Jardary Al Ahmed¹, Nishad Bapatdhar², Bipin Pradeep Kumar², Samik Ghosh^{1

2}, Ayako Yachie^{1

2}, Sucheendra K Palaniappan^{1

2}

Affiliations

¹ SBX Corporation Inc., Tokyo, Japan.
² The NLP Group, The Systems Biology Institute, Tokyo, Japan.

Abstract

Text mining has been shown to be an auxiliary but key driver for modeling, data harmonization, and interpretation in bio-medicine. Scientific literature holds a wealth of information and embodies cumulative knowledge and remains the core basis on which mechanistic pathways, molecular databases, and models are built and refined. Text mining provides the necessary tools to automatically harness the potential of text. In this study, we show the potential of large-scale text mining for deriving novel insights, with a focus on the growing field of microbiome. We first collected the complete set of abstracts relevant to the microbiome from PubMed and used our text mining and intelligence platform Taxila for analysis. We drive the usefulness of text mining using two case studies. First, we analyze the geographical distribution of research and study locations for the field of microbiome by extracting geo mentions from text. Using this analysis, we were able to draw useful insights on the state of research in microbiome w. r.t geographical distributions and economic drivers. Next, to understand the relationships between diseases, microbiome, and food which are central to the field, we construct semantic relationship networks between these different concepts central to the field of microbiome. We show how such networks can be useful to derive useful insight with no prior knowledge encoded.

Keywords: PubMed; disease; food; hypothesis generation; microbiome; nlp; text-mining; word2vec.