NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Kanix Wang; Robert Stevens; Halima Alachram; Yu Li; Larisa Soldatova; Ross King; Sophia Ananiadou; Annika M Schoene; Maolin Li; Fenia Christopoulou; José Luis Ambite; Joel Matthew; Sahil Garg; Ulf Hermjakob; Daniel Marcu; Emily Sheng; Tim Beißbarth; Edgar Wingender; Aram Galstyan; Xin Gao; Brendan Chambers; Weidi Pan; Bohdan B Khomtchouk; James A Evans; Andrey Rzhetsky

doi:10.1038/s41540-021-00200-x

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

NPJ Syst Biol Appl. 2021 Oct 20;7(1):38. doi: 10.1038/s41540-021-00200-x.

Authors

Kanix Wang^{1

2}, Robert Stevens³, Halima Alachram⁴, Yu Li⁵, Larisa Soldatova⁶, Ross King^{7

8

9}, Sophia Ananiadou^{3

10}, Annika M Schoene^{3

10}, Maolin Li^{3

10}, Fenia Christopoulou^{3

10}, José Luis Ambite¹¹, Joel Matthew¹¹, Sahil Garg¹¹, Ulf Hermjakob¹¹, Daniel Marcu¹¹, Emily Sheng¹¹, Tim Beißbarth⁴, Edgar Wingender¹², Aram Galstyan¹¹, Xin Gao⁵, Brendan Chambers¹³, Weidi Pan¹⁴, Bohdan B Khomtchouk^{15

16}, James A Evans¹⁷, Andrey Rzhetsky^{18

19

20

21}

Affiliations

¹ The Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, 60637, US.
² The Institute of Genomics and Systems Biology, University of Chicago, Chicago, IL, 60637, US.
³ Depatment of Computer Science, University of Manchester, M13 9PL, Manchester, UK.
⁴ Institute of Medical Bioinformatics, University of Göttingen, Goldschmidtstrasse 1, 37077, Göttingen, Germany.
⁵ Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division King Abdullah University of Science and Technology (KAUST) Thuwal, Thuwal, 23955, Saudi Arabia.
⁶ Goldsmiths, University of London, 8 Lewisham Way, New Cross, London, SE14 6NW, UK.
⁷ Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Dr, Cambridge, CB3 0AS, United Kingdom.
⁸ Alan Turing Institute, 96 Euston Rd, Somers Town, London, NW1 2DB, United Kingdom.
⁹ Department of Biology and Biological Engineering, Chalmers University of Technology, SE-412 96, Göteborg, Sweden.
¹⁰ National Centre for Text Mining, University of Manchester, M1 7DN, Manchester, UK.
¹¹ The Information Sciences Institute, University of Southern California, Marina del Rey, CA, 90089, US.
¹² geneXplain GmbH, Am Exer19b, 38302, Wolfenbüttel, Germany.
¹³ Knowledge Lab, Department of Sociology, University of Chicago, Chicago, IL, 60637, US.
¹⁴ Master of Science in Statistics Program, University of Chicago, Chicago, IL, 60637, US.
¹⁵ The Institute of Genomics and Systems Biology, University of Chicago, Chicago, IL, 60637, US. bohdan@uchicago.edu.
¹⁶ Department of Medicine, University of Chicago, Chicago, IL, 60637, US. bohdan@uchicago.edu.
¹⁷ Knowledge Lab, Department of Sociology, University of Chicago, Chicago, IL, 60637, US. jevans@uchicago.edu.
¹⁸ The Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, 60637, US. andrey.rzhetsky@uchicago.edu.
¹⁹ The Institute of Genomics and Systems Biology, University of Chicago, Chicago, IL, 60637, US. andrey.rzhetsky@uchicago.edu.
²⁰ Department of Medicine, University of Chicago, Chicago, IL, 60637, US. andrey.rzhetsky@uchicago.edu.
²¹ Department of Human Genetics, University of Chicago, Chicago, IL, 60637, US. andrey.rzhetsky@uchicago.edu.

Abstract

Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades^1,2, the most dramatic advances in MR have followed in the wake of critical corpus development³. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet⁴ was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

Abstract

Publication types

Grants and funding