Graph-Aware Language Model Pre-Training on a Large Graph Corpus Can Help Multiple Graph Applications

Han Xie; Vassilis N Ioannidis; Carl Yang; Da Zheng; Xiang Song; Yi Xu; Jun Ma; Qing Ping; Belinda Zeng; Houyu Zhang; Sheng Wang; Trishul Chilimbi

doi:10.1145/3580305.3599833

Graph-Aware Language Model Pre-Training on a Large Graph Corpus Can Help Multiple Graph Applications

KDD. 2023 Aug:2023:5270-5281. doi: 10.1145/3580305.3599833. Epub 2023 Aug 4.

Authors

Han Xie¹, Vassilis N Ioannidis², Carl Yang¹, Da Zheng³, Xiang Song³, Yi Xu⁴, Jun Ma⁵, Qing Ping⁶, Belinda Zeng⁴, Houyu Zhang⁴, Sheng Wang⁷, Trishul Chilimbi⁴

Affiliations

¹ Emory University Atlanta, GA, USA.
² Amazon Search AI Santa Clara, CA, USA.
³ Amazon AWS AI Santa Clara, CA, USA.
⁴ Amazon Search AI Seattle, WA, USA.
⁵ Walgreens AI Lab Bellevue, WA, USA.
⁶ Amazon Search AI Palo Alto, CA, USA.
⁷ Amazon Scholar Seattle, WA, USA.

Abstract

Model pre-training on large text corpora has been demonstrated effective for various downstream applications in the NLP domain. In the graph mining domain, a similar analogy can be drawn for pre-training graph models on large graphs in the hope of benefiting downstream graph applications, which has also been explored by several recent studies. However, no existing study has ever investigated the pre-training of text plus graph models on large heterogeneous graphs with abundant textual information (a.k.a. large graph corpora) and then fine-tuning the model on different related downstream applications with different graph schemas. To address this problem, we propose a framework of graph-aware language model pre-training (GaLM) on a large graph corpus, which incorporates large language models and graph neural networks, and a variety of fine-tuning methods on downstream applications. We conduct extensive experiments on Amazon's real internal datasets and large public datasets. Comprehensive empirical results and in-depth analysis demonstrate the effectiveness of our proposed methods along with lessons learned.

Keywords: Graph Neural Network; Heterogeneous Graph; Large Language Model; Pre-Training and Fine-Tuning.

Grants and funding

K25 DK135913/DK/NIDDK NIH HHS/United States