Domain-specific Topic Model for Knowledge Discovery in Computational and Data-Intensive Scientific Communities

Yuanxun Zhang; Prasad Calyam; Trupti Joshi; Satish Nair; Dong Xu

doi:10.1109/tkde.2021.3093350

Domain-specific Topic Model for Knowledge Discovery in Computational and Data-Intensive Scientific Communities

IEEE Trans Knowl Data Eng. 2023 Feb;35(2):1402-1420. doi: 10.1109/tkde.2021.3093350. Epub 2021 Jul 1.

Authors

Yuanxun Zhang¹, Prasad Calyam¹, Trupti Joshi¹, Satish Nair¹, Dong Xu¹

Affiliation

¹ Department of Electrical Engineering and Computer Science, University of Missouri-Columbia, Columbia, MO, 65211.

Abstract

Shortened time to knowledge discovery and adapting prior domain knowledge is a challenge for computational and data-intensive communities such as e.g., bioinformatics and neuroscience. The challenge for a domain scientist lies in the actions to obtain guidance through query of massive information from diverse text corpus comprising of a wide-ranging set of topics when: investigating new methods, developing new tools, or integrating datasets. In this paper, we propose a novel "domain-specific topic model" (DSTM) to discover latent knowledge patterns about relationships among research topics, tools and datasets from exemplary scientific domains. Our DSTM is a generative model that extends the Latent Dirichlet Allocation (LDA) model and uses the Markov chain Monte Carlo (MCMC) algorithm to infer latent patterns within a specific domain in an unsupervised manner. We apply our DSTM to large collections of data from bioinformatics and neuroscience domains that include more than 25,000 of papers over the last ten years, featuring hundreds of tools and datasets that are commonly used in relevant studies. Evaluation experiments based on generalization and information retrieval metrics show that our model has better performance than the state-of-the-art baseline models for discovering highly-specific latent topics within a domain. Lastly, we demonstrate applications that benefit from our DSTM to discover intra-domain, cross-domain and trend knowledge patterns.

Keywords: Latent Dirichlet Allocation; Multi-disciplinary Knowledge Discovery; Theoretical Model for Big Data; Topic Model.

Grants and funding

R01 MH122023/MH/NIMH NIH HHS/United States