Hybrid Ensemble-Rule Algorithm for Improved MEDLINE® Sentence Boundary Detection

AMIA Annu Symp Proc. 2022 Feb 21:2021:677-686. eCollection 2021.

Abstract

Sentence boundary detection (SBD) is a fundamental building block in the Natural Language Processing (NLP) pipeline. Incorrect SBD may impact subsequent processing stages resulting in decreased performance. In well-behaved corpora, a few simple rules based on punctuation and capitalization are sufficient for successfully detecting sentence boundaries. However, a corpus like MEDLINE citations presents challenges for SBD due to several syntactic ambiguities, e.g., abbreviation-periods, capital letters in first words of sentences, etc. In this manuscript we present an algorithm to address these challenges based on majority voting among three SBD engines (Python NLTK, pySBD, and Syntok) followed by custom post-processing algorithms that rely on NLP spaCy part-of-speech, abbreviation and capital letter detection, and computing general sentence statistics. Experiments on several thousand MEDLINE citations show that our proposed approach for combining multiple SBD engines and post-processing rules performs better than each individual engine.

Publication types

  • Research Support, N.I.H., Intramural

MeSH terms

  • Algorithms
  • Humans
  • Language*
  • MEDLINE
  • Natural Language Processing*