Rule-based human gene normalization in biomedical text with confidence estimation

Comput Syst Bioinformatics Conf. 2007:6:371-9.

Abstract

The ability to identify gene mentions in text and normalize them to the proper unique identifiers is crucial for "down-stream" text mining applications in bioinformatics. We have developed a rule-based algorithm that divides the normalization task into two steps. The first step includes pattern matching for gene symbols and an approximate term searching technique for gene names. Next, the algorithm measures several features based on morphological, statistical, and contextual information to estimate the level of confidence that the correct identifier is selected for a potential mention. Uniqueness, inverse distance, and coverage are three novel features we quantified. The algorithm was evaluated against the BioCreAtIvE datasets. The feature weights were tuned by the Nealder-Mead simplex method. An F-score of .7622 and an AUC (area under the recall-precision curve) of .7461 were achieved on the test data using the set of weights optimized to the training data.

Publication types

  • Research Support, N.I.H., Intramural

MeSH terms

  • Abstracting and Indexing / methods*
  • Artificial Intelligence*
  • Computer Graphics
  • Confidence Intervals
  • Data Interpretation, Statistical
  • Database Management Systems*
  • Documentation / methods
  • Genes*
  • Information Storage and Retrieval / methods*
  • Internet
  • Natural Language Processing*
  • PubMed*
  • User-Computer Interface