Pre-training phenotyping classifiers

Dmitriy Dligach; Majid Afshar; Timothy Miller

doi:10.1016/j.jbi.2020.103626

Pre-training phenotyping classifiers

J Biomed Inform. 2021 Jan:113:103626. doi: 10.1016/j.jbi.2020.103626. Epub 2020 Nov 28.

Authors

Dmitriy Dligach¹, Majid Afshar², Timothy Miller³

Affiliations

¹ Loyola University Chicago, Department of Computer Science, Chicago, IL, United States. Electronic address: dd@cs.luc.edu.
² Department of Medicine, School of Medicine and Public Health, University of Wisconsin Madison, Madison, WI, United States. Electronic address: mafshar@medicine.wisc.edu.
³ Computational Health Informatics Program (CHIP), Boston Children's Hospital and Harvard Medical School, Boston, MA, United States. Electronic address: timothy.miller@childrens.harvard.edu.

Abstract

Recent transformer-based pre-trained language models have become a de facto standard for many text classification tasks. Nevertheless, their utility in the clinical domain, where classification is often performed at encounter or patient level, is still uncertain due to the limitation on the maximum length of input. In this work, we introduce a self-supervised method for pre-training that relies on a masked token objective and is free from the limitation on the maximum input length. We compare the proposed method with supervised pre-training that uses billing codes as a source of supervision. We evaluate the proposed method on one publicly-available and three in-house datasets using the standard evaluation metrics such as the area under the ROC curve and F1 score. We find that, surprisingly, even though self-supervised pre-training performs slightly worse than supervised, it still preserves most of the gains from pre-training.

Keywords: Automatic phenotyping; Natural language processing; Pre-training.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Humans
Language*
Natural Language Processing*
ROC Curve

Abstract

Publication types

MeSH terms

Grants and funding