Approaches to verb subcategorization for biomedicine

J Biomed Inform. 2013 Apr;46(2):212-27. doi: 10.1016/j.jbi.2012.12.001. Epub 2012 Dec 28.

Abstract

Information about verb subcategorization frames (SCFs) is important to many tasks in natural language processing (NLP) and, in turn, text mining. Biomedicine has a need for high-quality SCF lexicons to support the extraction of information from the biomedical literature, which helps biologists to take advantage of the latest biomedical knowledge despite the overwhelming growth of that literature. Unfortunately, techniques for creating such resources for biomedical text are relatively undeveloped compared to general language. This paper serves as an introduction to subcategorization and existing approaches to acquisition, and provides motivation for developing techniques that address issues particularly important to biomedical NLP. First, we give the traditional linguistic definition of subcategorization, along with several related concepts. Second, we describe approaches to learning SCF lexicons from large data sets for general and biomedical domains. Third, we consider the crucial issue of linguistic variation between biomedical fields (subdomain variation). We demonstrate significant variation among subdomains, and find the variation does not simply follow patterns of general lexical variation. Finally, we note several requirements for future research in biomedical SCF lexicon acquisition: a high-quality gold standard, investigation of different definitions of subcategorization, and minimally-supervised methods that can learn subdomain-specific lexical usage without the need for extensive manual work.

Publication types

  • Review

MeSH terms

  • Abstracting and Indexing
  • Animals
  • Biological Science Disciplines / classification*
  • Biomedical Research / classification*
  • Cluster Analysis
  • Computational Biology
  • Data Mining*
  • Humans
  • Natural Language Processing*