OGER++: hybrid multi-type entity recognition

Lenz Furrer; Anna Jancso; Nicola Colic; Fabio Rinaldi

doi:10.1186/s13321-018-0326-3

OGER++: hybrid multi-type entity recognition

J Cheminform. 2019 Jan 21;11(1):7. doi: 10.1186/s13321-018-0326-3.

Authors

Lenz Furrer¹, Anna Jancso¹, Nicola Colic¹, Fabio Rinaldi^{2

3}

Affiliations

¹ Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland.
² Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland. fabio.rinaldi@uzh.ch.
³ Fondazione Bruno Kessler, Via Sommarive, 18, 38123, Trento, Italy. fabio.rinaldi@uzh.ch.

Abstract

Background: We present a text-mining tool for recognizing biomedical entities in scientific literature. OGER++ is a hybrid system for named entity recognition and concept recognition (linking), which combines a dictionary-based annotator with a corpus-based disambiguation component. The annotator uses an efficient look-up strategy combined with a normalization method for matching spelling variants. The disambiguation classifier is implemented as a feed-forward neural network which acts as a postfilter to the previous step.

Results: We evaluated the system in terms of processing speed and annotation quality. In the speed benchmarks, the OGER++ web service processes 9.7 abstracts or 0.9 full-text documents per second. On the CRAFT corpus, we achieved 71.4% and 56.7% F1 for named entity recognition and concept recognition, respectively.

Conclusions: Combining knowledge-based and data-driven components allows creating a system with competitive performance in biomedical text mining.

Keywords: Concept recognition; Machine learning; Named entity recognition; Natural language processing.

Grants and funding

CR30I1 162758/Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung