Rule-based human gene normalization in biomedical text with confidence estimation

William W Lau; Calvin A Johnson; Kevin G Becker

Rule-based human gene normalization in biomedical text with confidence estimation

Comput Syst Bioinformatics Conf. 2007:6:371-9.

Authors

William W Lau¹, Calvin A Johnson, Kevin G Becker

Affiliation

¹ Center for Information Technology, National Institutes of Health, Bethesda, MD 20892-5624, USA.

PMID: 17951839

Abstract

The ability to identify gene mentions in text and normalize them to the proper unique identifiers is crucial for "down-stream" text mining applications in bioinformatics. We have developed a rule-based algorithm that divides the normalization task into two steps. The first step includes pattern matching for gene symbols and an approximate term searching technique for gene names. Next, the algorithm measures several features based on morphological, statistical, and contextual information to estimate the level of confidence that the correct identifier is selected for a potential mention. Uniqueness, inverse distance, and coverage are three novel features we quantified. The algorithm was evaluated against the BioCreAtIvE datasets. The feature weights were tuned by the Nealder-Mead simplex method. An F-score of .7622 and an AUC (area under the recall-precision curve) of .7461 were achieved on the test data using the set of weights optimized to the training data.

Publication types

Research Support, N.I.H., Intramural

MeSH terms

Abstracting and Indexing / methods*
Artificial Intelligence*
Computer Graphics
Confidence Intervals
Data Interpretation, Statistical
Database Management Systems*
Documentation / methods
Genes*
Information Storage and Retrieval / methods*
Internet
Natural Language Processing*
PubMed*
User-Computer Interface

Grants and funding

Intramural NIH HHS/United States