Classification and analysis of a large collection of in vivo bioassay descriptions

Magdalena Zwierzyna; John P Overington

doi:10.1371/journal.pcbi.1005641

Classification and analysis of a large collection of in vivo bioassay descriptions

PLoS Comput Biol. 2017 Jul 5;13(7):e1005641. doi: 10.1371/journal.pcbi.1005641. eCollection 2017 Jul.

Authors

Magdalena Zwierzyna^{1

2}, John P Overington^{1

2}

Affiliations

¹ BenevolentAI, London, United Kingdom.
² Institute of Cardiovascular Science, University College London, London, United Kingdom.

Abstract

Testing potential drug treatments in animal disease models is a decisive step of all preclinical drug discovery programs. Yet, despite the importance of such experiments for translational medicine, there have been relatively few efforts to comprehensively and consistently analyze the data produced by in vivo bioassays. This is partly due to their complexity and lack of accepted reporting standards-publicly available animal screening data are only accessible in unstructured free-text format, which hinders computational analysis. In this study, we use text mining to extract information from the descriptions of over 100,000 drug screening-related assays in rats and mice. We retrieve our dataset from ChEMBL-an open-source literature-based database focused on preclinical drug discovery. We show that in vivo assay descriptions can be effectively mined for relevant information, including experimental factors that might influence the outcome and reproducibility of animal research: genetic strains, experimental treatments, and phenotypic readouts used in the experiments. We further systematize extracted information using unsupervised language model (Word2Vec), which learns semantic similarities between terms and phrases, allowing identification of related animal models and classification of entire assay descriptions. In addition, we show that random forest models trained on features generated by Word2Vec can predict the class of drugs tested in different in vivo assays with high accuracy. Finally, we combine information mined from text with curated annotations stored in ChEMBL to investigate the patterns of usage of different animal models across a range of experiments, drug classes, and disease areas.

MeSH terms

Biological Assay / methods*
Data Mining / methods
Databases, Factual*
Drug Evaluation, Preclinical / methods*
High-Throughput Screening Assays / methods*
Machine Learning*
Natural Language Processing*
Reproducibility of Results
Sensitivity and Specificity

Grants and funding

BenevolentAI Ltd. is a private company, which funded the work described in the article. JPO was an employee of BenevolentAI Ltd., but now has a position at the Medicines Discovery Catapult, who had no role in funding this work. There was no grant funding used in the research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.