Progressive sampling-based Bayesian optimization for efficient and automatic machine learning model selection

Xueqiang Zeng; Gang Luo

doi:10.1007/s13755-017-0023-z

Progressive sampling-based Bayesian optimization for efficient and automatic machine learning model selection

Health Inf Sci Syst. 2017 Sep 27;5(1):2. doi: 10.1007/s13755-017-0023-z. eCollection 2017 Dec.

Authors

Xueqiang Zeng¹, Gang Luo²

Affiliations

¹ Computer Center, Nanchang University, 999 Xuefu Road, Nanchang, 330031 Jiangxi People's Republic of China.
² Department of Biomedical Informatics and Medical Education, University of Washington, UW Medicine South Lake Union, 850 Republican Street, Building C, Box 358047, Seattle, WA 98109 USA.

Abstract

Purpose: Machine learning is broadly used for clinical data analysis. Before training a model, a machine learning algorithm must be selected. Also, the values of one or more model parameters termed hyper-parameters must be set. Selecting algorithms and hyper-parameter values requires advanced machine learning knowledge and many labor-intensive manual iterations. To lower the bar to machine learning, miscellaneous automatic selection methods for algorithms and/or hyper-parameter values have been proposed. Existing automatic selection methods are inefficient on large data sets. This poses a challenge for using machine learning in the clinical big data era.

Methods: To address the challenge, this paper presents progressive sampling-based Bayesian optimization, an efficient and automatic selection method for both algorithms and hyper-parameter values.

Results: We report an implementation of the method. We show that compared to a state of the art automatic selection method, our method can significantly reduce search time, classification error rate, and standard deviation of error rate due to randomization.

Conclusions: This is major progress towards enabling fast turnaround in identifying high-quality solutions required by many machine learning-based clinical data analysis tasks.

Keywords: Automatic machine learning model selection; Bayesian optimization; Clinical big data; Progressive sampling.