Machine learning for biomedical literature triage

Hayda Almeida; Marie-Jean Meurs; Leila Kosseim; Greg Butler; Adrian Tsang

doi:10.1371/journal.pone.0115892

Machine learning for biomedical literature triage

PLoS One. 2014 Dec 31;9(12):e115892. doi: 10.1371/journal.pone.0115892. eCollection 2014.

Authors

Hayda Almeida¹, Marie-Jean Meurs², Leila Kosseim¹, Greg Butler³, Adrian Tsang²

Affiliations

¹ Department of Computer Science and Software Engineering, Concordia University, Montreal, QC, Canada.
² Centre for Structural and Functional Genomics, Concordia University, Montreal, QC, Canada.
³ Department of Computer Science and Software Engineering, Concordia University, Montreal, QC, Canada; Centre for Structural and Functional Genomics, Concordia University, Montreal, QC, Canada.

Abstract

This paper presents a machine learning system for supporting the first task of the biological literature manual curation process, called triage. We compare the performance of various classification models, by experimenting with dataset sampling factors and a set of features, as well as three different machine learning algorithms (Naive Bayes, Support Vector Machine and Logistic Model Trees). The results show that the most fitting model to handle the imbalanced datasets of the triage classification task is obtained by using domain relevant features, an under-sampling technique, and the Logistic Model Trees algorithm.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Bayes Theorem
Databases, Bibliographic*
Decision Trees
Medical Informatics / methods*
Models, Theoretical
Support Vector Machine*

Grants and funding

This work was supported by funding from Genome Canada (http://www.genomecanada.ca), Genome Quebec (http://www.genomequebec.com/), and Genome Alberta (http://genomealberta.ca/), to AT. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.