D3NER: biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information

Bioinformatics. 2018 Oct 15;34(20):3539-3546. doi: 10.1093/bioinformatics/bty356.

Abstract

Motivation: Recognition of biomedical named entities in the textual literature is a highly challenging research topic with great interest, playing as the prerequisite for extracting huge amount of high-valued biomedical knowledge deposited in unstructured text and transforming them into well-structured formats. Long Short-Term Memory (LSTM) networks have recently been employed in various biomedical named entity recognition (NER) models with great success. They, however, often did not take advantages of all useful linguistic information and still have many aspects to be further improved for better performance.

Results: We propose D3NER, a novel biomedical named entity recognition (NER) model using conditional random fields and bidirectional long short-term memory improved with fine-tuned embeddings of various linguistic information. D3NER is thoroughly compared with seven very recent state-of-the-art NER models, of which two are even joint models with named entity normalization (NEN), which was proven to bring performance improvements to NER. Experimental results on benchmark datasets, i.e. the BioCreative V Chemical Disease Relation (BC5 CDR), the NCBI Disease and the FSU-PRGE gene/protein corpus, demonstrate the out-performance and stability of D3NER over all compared models for chemical, gene/protein NER and over all models (without NEN jointed, as D3NER) for disease NER, in almost all cases. On the BC5 CDR corpus, D3NER achieves F1 of 93.14 and 84.68% for the chemical and disease NER, respectively; while on the NCBI Disease corpus, its F1 for the disease NER is 84.41%. Its F1 for the gene/protein NER on FSU-PRGE is 87.62%.

Availability and implementation: Data and source code are available at: https://github.com/aidantee/D3NER.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Benchmarking
  • Humans
  • Linguistics*
  • Molecular Sequence Annotation*
  • Proteins / analysis
  • Proteins / genetics
  • Software*

Substances

  • Proteins