UGDAS: Unsupervised graph-network based denoiser for abstractive summarization in biomedical domain

Yongping Du; Yiliang Zhao; Jingya Yan; Qingxiao Li

doi:10.1016/j.ymeth.2022.03.012

UGDAS: Unsupervised graph-network based denoiser for abstractive summarization in biomedical domain

Methods. 2022 Jul:203:160-166. doi: 10.1016/j.ymeth.2022.03.012. Epub 2022 Apr 2.

Authors

Yongping Du¹, Yiliang Zhao², Jingya Yan³, Qingxiao Li⁴

Affiliations

¹ Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China. Electronic address: ypdu@bjut.edu.cn.
² Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China. Electronic address: ylzhao7@yeah.net.
³ Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China. Electronic address: yanjy1998@163.com.
⁴ Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China. Electronic address: qxli_bjut@163.com.

PMID: 35378296
DOI: 10.1016/j.ymeth.2022.03.012

Abstract

Abstractive summarization models can generate summary auto-regressively, but the quality is often impacted by the noise in the text. Learning cross-sentence relations is a crucial step in this task and the graph-based network is more effective to capture the sentence relationship. Moreover, knowledge is very important to distinguish the noise of the text in special domain. A novel model structure called UGDAS is proposed in this paper, which combines a sentence-level denoiser based on an unsupervised graph-network and an auto-regressive generator. It utilizes domain knowledge and sentence position information to denoise the original text and further improve the quality of generated summaries. We use the recently-introduced dataset CORD-19 (COVID-19 Open Research Dataset) on text summarization task, which contains large-scale data on coronaviruses. The experimental results show that our model achieves the SOTA (state-of-the-art) result on CORD-19 dataset and outperforms the related baseline models on the PubMed Abstract dataset.

Keywords: Abstractive summarization; Domain knowledge; Graph-network; Pre-trained language model.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

COVID-19*
Concept Formation
Humans
Semantics*