Transfer learning based clinical concept extraction on data from multiple sources

Xinbo Lv; Yi Guan; Benyang Deng

doi:10.1016/j.jbi.2014.05.006

Transfer learning based clinical concept extraction on data from multiple sources

J Biomed Inform. 2014 Dec:52:55-64. doi: 10.1016/j.jbi.2014.05.006. Epub 2014 May 21.

Authors

Xinbo Lv¹, Yi Guan², Benyang Deng¹

Affiliations

¹ School of Computer Science and Technology, Harbin Institution of Technology, Harbin, Heilongjiang 150001, China.
² School of Computer Science and Technology, Harbin Institution of Technology, Harbin, Heilongjiang 150001, China. Electronic address: guanyi@hit.edu.cn.

PMID: 24859154
DOI: 10.1016/j.jbi.2014.05.006

Abstract

Machine learning methods usually assume that training data and test data are drawn from the same distribution. However, this assumption often cannot be satisfied in the task of clinical concept extraction. The main aim of this paper was to use training data from one institution to build a concept extraction model for data from another institution with a different distribution. An instance-based transfer learning method, TrAdaBoost, was applied in this work. To prevent the occurrence of a negative transfer phenomenon with TrAdaBoost, we integrated it with Bagging, which provides a "softer" weights update mechanism with only a tiny amount of training data from the target domain. Two data sets named BETH and PARTNERS from the 2010 i2b2/VA challenge as well as BETHBIO, a data set we constructed ourselves, were employed to show the effectiveness of our work's transfer ability. Our method outperforms the baseline model by 2.3% and 4.4% when the baseline model is trained by training data that are combined from the source domain and the target domain in two experiments of BETH vs. PARTNERS and BETHBIO vs. PARTNERS, respectively. Additionally, confidence intervals for the performance metrics suggest that our method's results have statistical significance. Moreover, we explore the applicability of our method for further experiments. With our method, only a tiny amount of labeled data from the target domain is required to build a concept extraction model that produces better performance.

Keywords: Bagging; Clinical concept extraction; Machine learning; TrAdaBoost; Transfer learning.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Artificial Intelligence*
Data Mining / methods*
Electronic Health Records*
Humans
Medical Informatics
Natural Language Processing*
Vocabulary, Controlled

Grants and funding

U54LM008748/LM/NLM NIH HHS/United States