Identification of most influential co-occurring gene suites for gastrointestinal cancer using biomedical literature mining and graph-based influence maximization

Charles C N Wang; Jennifer Jin; Jan-Gowth Chang; Masahiro Hayakawa; Atsushi Kitazawa; Jeffrey J P Tsai; Phillip C-Y Sheu

doi:10.1186/s12911-020-01227-6

Identification of most influential co-occurring gene suites for gastrointestinal cancer using biomedical literature mining and graph-based influence maximization

BMC Med Inform Decis Mak. 2020 Sep 3;20(1):208. doi: 10.1186/s12911-020-01227-6.

Authors

Charles C N Wang^{1

2}, Jennifer Jin³, Jan-Gowth Chang^{4

5

6}, Masahiro Hayakawa⁷, Atsushi Kitazawa⁷, Jeffrey J P Tsai¹, Phillip C-Y Sheu⁸

Affiliations

¹ Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan.
² Center for Artificial Intelligence in Precision Medicine, UAsia University, Taichung, Taiwan.
³ Department of EECS and BME, University of California, Irvine, USA.
⁴ Department of Laboratory Medicine, China Medical University Hospital, Taichung, Taiwan.
⁵ Center for Precision Medicine, China Medical University Hospital, Taichung, Taiwan.
⁶ Graduate Institute of Clinical Medical Science, School of Medicine, College of Medicine, China Medical University, Taichung, Taiwan.
⁷ NEC Solution Innovators, Koto-ku, Tokyo, Japan.
⁸ Department of EECS and BME, University of California, Irvine, USA. psheu@uci.edu.

Abstract

Background: Gastrointestinal (GI) cancer including colorectal cancer, gastric cancer, pancreatic cancer, etc., are among the most frequent malignancies diagnosed annually and represent a major public health problem worldwide.

Methods: This paper reports an aided curation pipeline to identify potential influential genes for gastrointestinal cancer. The curation pipeline integrates biomedical literature to identify named entities by Bi-LSTM-CNN-CRF methods. The entities and their associations can be used to construct a graph, and from which we can compute the sets of co-occurring genes that are the most influential based on an influence maximization algorithm.

Results: The sets of co-occurring genes that are the most influential that we discover include RARA - CRBP1, CASP3 - BCL2, BCL2 - CASP3 - CRBP1, RARA - CASP3 - CRBP1, FOXJ1 - RASSF3 - ESR1, FOXJ1 - RASSF1A - ESR1, FOXJ1 - RASSF1A - TNFAIP8 - ESR1. With TCGA and functional and pathway enrichment analysis, we prove the proposed approach works well in the context of gastrointestinal cancer.

Conclusions: Our pipeline that uses text mining to identify objects and relationships to construct a graph and uses graph-based influence maximization to discover the most influential co-occurring genes presents a viable direction to assist knowledge discovery for clinical applications.

Keywords: Bi-LSTM-CNN-CRF; Co-occurrence network; Gastrointestinal cancer; Influence maximization; Text mining.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Apoptosis Regulatory Proteins
Data Mining*
Gastrointestinal Neoplasms* / genetics
Genes, Neoplasm*
Humans

Substances

Apoptosis Regulatory Proteins