Chemical-protein relation extraction with ensembles of carefully tuned pretrained language models

Database (Oxford). 2022 Nov 18:2022:baac098. doi: 10.1093/database/baac098.

Abstract

The identification of chemical-protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benchmark for methods for the automated extraction of chemical-protein relations from scientific text. Here we describe our contribution to the shared task and report on the achieved results. We define the task as a relation classification problem, which we approach with pretrained transformer language models. Upon this basic architecture, we experiment with utilizing textual and embedded side information from knowledge bases as well as additional training data to improve extraction performance. We perform a comprehensive evaluation of the proposed model and the individual extensions including an extensive hyperparameter search leading to 2647 different runs. We find that ensembling and choosing the right pretrained language model are crucial for optimal performance, whereas adding additional data and embedded side information did not improve results. Our best model is based on an ensemble of 10 pretrained transformers and additional textual descriptions of chemicals taken from the Comparative Toxicogenomics Database. The model reaches an F1 score of 79.73% on the hidden DrugProt test set and achieves the first rank out of 107 submitted runs in the official evaluation. Database URL: https://github.com/leonweber/drugprot.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Databases, Factual
  • Language*
  • Proteins*
  • Toxicogenetics

Substances

  • Proteins