nala: text mining natural language mutation mentions

Juan Miguel Cejuela; Aleksandar Bojchevski; Carsten Uhlig; Rustem Bekmukhametov; Sanjeev Kumar Karn; Shpend Mahmuti; Ashish Baghudana; Ankit Dubey; Venkata P Satagopam; Burkhard Rost

doi:10.1093/bioinformatics/btx083

nala: text mining natural language mutation mentions

Bioinformatics. 2017 Jun 15;33(12):1852-1858. doi: 10.1093/bioinformatics/btx083.

Authors

Juan Miguel Cejuela^{1

2}, Aleksandar Bojchevski^{1

2}, Carsten Uhlig¹, Rustem Bekmukhametov^{1

3}, Sanjeev Kumar Karn^{1

4}, Shpend Mahmuti¹, Ashish Baghudana^{1

5}, Ankit Dubey^{1

6}, Venkata P Satagopam⁷, Burkhard Rost^{1

8}

Affiliations

¹ TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Garching, Munich, Germany.
² TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching, Germany.
³ Microsoft, WA, Bellevue, USA.
⁴ Ludwig Maximilian University, 80538 Munich & Siemens AG, Corporate Technology, Munich, Germany.
⁵ BITS-Pilani K. K. Birla Goa Campus, Goa, India.
⁶ Concur (Germany) GmbH, Frankfurt am Main, Germany.
⁷ Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Belvaux, Luxembourg.
⁸ Institute of Advanced Study (TUM-IAS) & Institute for Food and Plant Sciences WZW - Weihenstephan & New York Consortium on Membrane Protein Structure (NYCOMPS) & Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA.

Abstract

Motivation: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. 'E6V'), leaving relevant mentions natural language (NL) largely untapped (e.g. 'glutamic acid was substituted by valine at residue 6').

Results: We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28-77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala ). Neither SETH nor tmVar discovered anything missed by nala , while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala -only.

Availability and implementation: Source code, API and corpora freely available at: http://tagtog.net/-corpora/IDP4+ .

Contact: nala@rostlab.org.

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

Data Mining / methods*
Humans
Mutation*
Natural Language Processing*
PubMed
Software*
Unsupervised Machine Learning