A machine-compiled database of genome-wide association studies

Volodymyr Kuleshov; Jialin Ding; Christopher Vo; Braden Hancock; Alexander Ratner; Yang Li; Christopher Ré; Serafim Batzoglou; Michael Snyder

doi:10.1038/s41467-019-11026-x

A machine-compiled database of genome-wide association studies

Nat Commun. 2019 Jul 26;10(1):3341. doi: 10.1038/s41467-019-11026-x.

Authors

Volodymyr Kuleshov^{1

2}, Jialin Ding³, Christopher Vo³, Braden Hancock³, Alexander Ratner³, Yang Li⁴, Christopher Ré³, Serafim Batzoglou³, Michael Snyder⁵

Affiliations

¹ Department of Computer Science, Stanford University, Stanford, CA, 94305, USA. kuleshov@cs.stanford.edu.
² Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA. kuleshov@cs.stanford.edu.
³ Department of Computer Science, Stanford University, Stanford, CA, 94305, USA.
⁴ Department of Medicine, University of Chicago, Chicago, IL, 60637, USA.
⁵ Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA.

Abstract

Tens of thousands of genotype-phenotype associations have been discovered to date, yet not all of them are easily accessible to scientists. Here, we describe GWASkb, a machine-compiled knowledge base of genetic associations collected from the scientific literature using automated information extraction algorithms. Our information extraction system helps curators by automatically collecting over 6,000 associations from open-access publications with an estimated recall of 60-80% and with an estimated precision of 78-94% (measured relative to existing manually curated knowledge bases). This system represents a fully automated GWAS curation effort and is made possible by a paradigm for constructing machine learning systems called data programming. Our work represents a step towards making the curation of scientific literature more efficient using automated systems.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Computational Biology
Data Mining
Databases, Genetic*
Genome, Human
Genome-Wide Association Study*
Humans
Machine Learning