Optimizations for Computing Relatedness in Biomedical Heterogeneous Information Networks: SemNet 2.0

Anna Kirkpatrick; Chidozie Onyeze; David Kartchner; Stephen Allegri; Davi Nakajima An; Kevin McCoy; Evie Davalbhakta; Cassie S Mitchell

doi:10.3390/bdcc6010027

Optimizations for Computing Relatedness in Biomedical Heterogeneous Information Networks: SemNet 2.0

Big Data Cogn Comput. 2022 Mar;6(1):27. doi: 10.3390/bdcc6010027. Epub 2022 Mar 1.

Authors

Anna Kirkpatrick^{1

2}, Chidozie Onyeze^{1

2}, David Kartchner^{1

3}, Stephen Allegri^{1

4}, Davi Nakajima An^{1

3}, Kevin McCoy^{1

4}, Evie Davalbhakta¹, Cassie S Mitchell^{1

4

5}

Affiliations

¹ Laboratory for Pathology Dynamics, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA.
² School of Mathematics, Georgia Institute of Technology, Atlanta, GA 30332, USA.
³ School of Computer Science, Georgia Institute of Technology, Atlanta, GA 30332, USA.
⁴ Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA.
⁵ Machine Learning Center at Georgia Tech, Georgia Institute of Technology, Atlanta, GA 30332, USA.

Abstract

Literature-based discovery (LBD) summarizes information and generates insight from large text corpuses. The SemNet framework utilizes a large heterogeneous information network or "knowledge graph" of nodes and edges to compute relatedness and rank concepts pertinent to a user-specified target. SemNet provides a way to perform multi-factorial and multi-scalar analysis of complex disease etiology and therapeutic identification using the 33+ million articles in PubMed. The present work improves the efficacy and efficiency of LBD for end users by augmenting SemNet to create SemNet 2.0. A custom Python data structure replaced reliance on Neo4j to improve knowledge graph query times by several orders of magnitude. Additionally, two randomized algorithms were built to optimize the HeteSim metric calculation for computing metapath similarity. The unsupervised learning algorithm for rank aggregation (ULARA), which ranks concepts with respect to the user-specified target, was reconstructed using derived mathematical proofs of correctness and probabilistic performance guarantees for optimization. The upgraded ULARA is generalizable to other rank aggregation problems outside of SemNet. In summary, SemNet 2.0 is a comprehensive open-source software for significantly faster, more effective, and user-friendly means of automated biomedical LBD. An example case is performed to rank relationships between Alzheimer's disease and metabolic co-morbidities.

Keywords: Alzheimer’s disease; HeteSim; SemNet; ULARA; biomedical knowledge graph; machine learning; natural language processing; rank aggregation; relatedness; text mining.

Abstract

Grants and funding