Graph-based biomedical text summarization: An itemset mining and sentence clustering approach

J Biomed Inform. 2018 Aug:84:42-58. doi: 10.1016/j.jbi.2018.06.005. Epub 2018 Jun 15.

Abstract

Objective: Automatic text summarization offers an efficient solution to access the ever-growing amounts of both scientific and clinical literature in the biomedical domain by summarizing the source documents while maintaining their most informative contents. In this paper, we propose a novel graph-based summarization method that takes advantage of the domain-specific knowledge and a well-established data mining technique called frequent itemset mining.

Methods: Our summarizer exploits the Unified Medical Language System (UMLS) to construct a concept-based model of the source document and mapping the document to the concepts. Then, it discovers frequent itemsets to take the correlations among multiple concepts into account. The method uses these correlations to propose a similarity function based on which a represented graph is constructed. The summarizer then employs a minimum spanning tree based clustering algorithm to discover various subthemes of the document. Eventually, it generates the final summary by selecting the most informative and relative sentences from all subthemes within the text.

Results: We perform an automatic evaluation over a large number of summaries using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics. The results demonstrate that the proposed summarization system outperforms various baselines and benchmark approaches.

Conclusion: The carried out research suggests that the incorporation of domain-specific knowledge and frequent itemset mining equips the summarization system in a better way to address the informativeness measurement of the sentences. Moreover, clustering the graph nodes (sentences) can enable the summarizer to target different main subthemes of a source document efficiently. The evaluation results show that the proposed approach can significantly improve the performance of the summarization systems in the biomedical domain.

Keywords: Biomedical literature summarization; Frequent itemset mining; Graph clustering; Minimum spanning tree based clustering; Similarity measure.

MeSH terms

  • Algorithms
  • Cluster Analysis*
  • Data Mining / methods*
  • Electronic Health Records
  • Medical Informatics / methods*
  • Pattern Recognition, Automated
  • Semantics*
  • Unified Medical Language System