Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics

Riza Batista-Navarro; Rafal Rak; Sophia Ananiadou

doi:10.1186/1758-2946-7-S1-S6

Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S6. doi: 10.1186/1758-2946-7-S1-S6. eCollection 2015.

Authors

Riza Batista-Navarro¹, Rafal Rak², Sophia Ananiadou²

Affiliations

¹ National Centre for Text Mining, Manchester Institute of Biotechnology, 131 Princess St, Manchester, M1 7DN, UK ; Department of Computer Science, University of the Philippines Diliman, Quezon City, 1101, Philippines.
² National Centre for Text Mining, Manchester Institute of Biotechnology, 131 Princess St, Manchester, M1 7DN, UK.

Abstract

Background: The development of robust methods for chemical named entity recognition, a challenging natural language processing task, was previously hindered by the lack of publicly available, large-scale, gold standard corpora. The recent public release of a large chemical entity-annotated corpus as a resource for the CHEMDNER track of the Fourth BioCreative Challenge Evaluation (BioCreative IV) workshop greatly alleviated this problem and allowed us to develop a conditional random fields-based chemical entity recogniser. In order to optimise its performance, we introduced customisations in various aspects of our solution. These include the selection of specialised pre-processing analytics, the incorporation of chemistry knowledge-rich features in the training and application of the statistical model, and the addition of post-processing rules.

Results: Our evaluation shows that optimal performance is obtained when our customisations are integrated into the chemical entity recogniser. When its performance is compared with that of state-of-the-art methods, under comparable experimental settings, our solution achieves competitive advantage. We also show that our recogniser that uses a model trained on the CHEMDNER corpus is suitable for recognising names in a wide range of corpora, consistently outperforming two popular chemical NER tools.

Conclusion: The contributions resulting from this work are two-fold. Firstly, we present the details of a chemical entity recognition methodology that has demonstrated performance at a competitive, if not superior, level as that of state-of-the-art methods. Secondly, the developed suite of solutions has been made publicly available as a configurable workflow in the interoperable text mining workbench Argo. This allows interested users to conveniently apply and evaluate our solutions in the context of other chemical text mining tasks.

Keywords: Chemical named entity recognition; Conditional random fields; Configurable workflows; Feature engineering; Sequence labelling; Text mining; Workflow optimisation.