Building machine learning models without sharing patient data: A simulation-based analysis of distributed learning by ensembling

J Biomed Inform. 2020 Jun:106:103424. doi: 10.1016/j.jbi.2020.103424. Epub 2020 Apr 23.

Abstract

The development of machine learning solutions in medicine is often hindered by difficulties associated with sharing patient data. Distributed learning aims to train machine learning models locally without requiring data sharing. However, the utility of distributed learning for rare diseases, with only a few training examples at each contributing local center, has not been investigated. The aim of this work was to simulate distributed learning models by ensembling with artificial neural networks (ANN), support vector machines (SVM), and random forests (RF) and evaluate them using four medical datasets. Distributed learning by ensembling locally trained agents improved performance compared to models trained using the data from a single institution, even in cases where only a very few training examples are available per local center. Distributed learning improved when more locally trained models were added to the ensemble. Local class imbalance reduced distributed SVM performance but did not impact distributed RF and ANN classification. Our results suggest that distributed learning by ensembling can be used to train machine learning models without sharing patient data and is suitable to use with small datasets.

Keywords: Artificial neural networks; Distributed learning; Machine learning; Medical information systems; Random forest; Support vector machines.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Computer Simulation
  • Humans
  • Machine Learning*
  • Neural Networks, Computer*
  • Support Vector Machine