Graph-based biomedical text summarization: An itemset mining and sentence clustering approach

Mozhgan Nasr Azadani; Nasser Ghadiri; Ensieh Davoodijam

doi:10.1016/j.jbi.2018.06.005

Graph-based biomedical text summarization: An itemset mining and sentence clustering approach

J Biomed Inform. 2018 Aug:84:42-58. doi: 10.1016/j.jbi.2018.06.005. Epub 2018 Jun 15.

Authors

Mozhgan Nasr Azadani¹, Nasser Ghadiri², Ensieh Davoodijam³

Affiliations

¹ Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran. Electronic address: mozhgan.nasr@ec.iut.ac.ir.
² Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran. Electronic address: nghadiri@cc.iut.ac.ir.
³ Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran. Electronic address: e.davoodijam@ec.iut.ac.ir.

PMID: 29906584
DOI: 10.1016/j.jbi.2018.06.005

Abstract

Objective: Automatic text summarization offers an efficient solution to access the ever-growing amounts of both scientific and clinical literature in the biomedical domain by summarizing the source documents while maintaining their most informative contents. In this paper, we propose a novel graph-based summarization method that takes advantage of the domain-specific knowledge and a well-established data mining technique called frequent itemset mining.

Methods: Our summarizer exploits the Unified Medical Language System (UMLS) to construct a concept-based model of the source document and mapping the document to the concepts. Then, it discovers frequent itemsets to take the correlations among multiple concepts into account. The method uses these correlations to propose a similarity function based on which a represented graph is constructed. The summarizer then employs a minimum spanning tree based clustering algorithm to discover various subthemes of the document. Eventually, it generates the final summary by selecting the most informative and relative sentences from all subthemes within the text.

Results: We perform an automatic evaluation over a large number of summaries using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics. The results demonstrate that the proposed summarization system outperforms various baselines and benchmark approaches.

Conclusion: The carried out research suggests that the incorporation of domain-specific knowledge and frequent itemset mining equips the summarization system in a better way to address the informativeness measurement of the sentences. Moreover, clustering the graph nodes (sentences) can enable the summarizer to target different main subthemes of a source document efficiently. The evaluation results show that the proposed approach can significantly improve the performance of the summarization systems in the biomedical domain.

Keywords: Biomedical literature summarization; Frequent itemset mining; Graph clustering; Minimum spanning tree based clustering; Similarity measure.

MeSH terms

Algorithms
Cluster Analysis*
Data Mining / methods*
Electronic Health Records
Medical Informatics / methods*
Pattern Recognition, Automated
Semantics*
Unified Medical Language System