A Parameter Communication Optimization Strategy for Distributed Machine Learning in Sensors

Jilin Zhang; Hangdi Tu; Yongjian Ren; Jian Wan; Li Zhou; Mingwei Li; Jue Wang; Lifeng Yu; Chang Zhao; Lei Zhang

doi:10.3390/s17102172

A Parameter Communication Optimization Strategy for Distributed Machine Learning in Sensors

Sensors (Basel). 2017 Sep 21;17(10):2172. doi: 10.3390/s17102172.

Authors

Jilin Zhang^{1

2

3

4

5}, Hangdi Tu^{6

7}, Yongjian Ren^{8

9}, Jian Wan^{10

11

12

13}, Li Zhou^{14

15}, Mingwei Li^{16

17}, Jue Wang¹⁸, Lifeng Yu^{19

20}, Chang Zhao^{21

22}, Lei Zhang²³

Affiliations

¹ School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China. jilin.zhang@hdu.edu.cn.
² Key Laboratory of Complex Systems Modeling and Simulation, Ministry of Education, Hangzhou 310018, China. jilin.zhang@hdu.edu.cn.
³ College of Electrical Engineering, Zhejiang University, Hangzhou 310058, China. jilin.zhang@hdu.edu.cn.
⁴ School of Information and Electronic engineering, Zhejiang University of Science & Technology, Hangzhou 310023, China. jilin.zhang@hdu.edu.cn.
⁵ Zhejiang Provincial Engineering Center on Media Data Cloud Processing and Analysis, Hangzhou 310018, China. jilin.zhang@hdu.edu.cn.
⁶ School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China. 152050103@hdu.edu.cn.
⁷ Key Laboratory of Complex Systems Modeling and Simulation, Ministry of Education, Hangzhou 310018, China. 152050103@hdu.edu.cn.
⁸ School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China. yongjian.ren@hdu.edu.cn.
⁹ Key Laboratory of Complex Systems Modeling and Simulation, Ministry of Education, Hangzhou 310018, China. yongjian.ren@hdu.edu.cn.
¹⁰ School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China. wanjian@hdu.edu.cn.
¹¹ Key Laboratory of Complex Systems Modeling and Simulation, Ministry of Education, Hangzhou 310018, China. wanjian@hdu.edu.cn.
¹² School of Information and Electronic engineering, Zhejiang University of Science & Technology, Hangzhou 310023, China. wanjian@hdu.edu.cn.
¹³ Zhejiang Provincial Engineering Center on Media Data Cloud Processing and Analysis, Hangzhou 310018, China. wanjian@hdu.edu.cn.
¹⁴ School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China. juliy26@hdu.edu.cn.
¹⁵ Key Laboratory of Complex Systems Modeling and Simulation, Ministry of Education, Hangzhou 310018, China. juliy26@hdu.edu.cn.
¹⁶ School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China. 161050009@hdu.edu.cn.
¹⁷ Key Laboratory of Complex Systems Modeling and Simulation, Ministry of Education, Hangzhou 310018, China. 161050009@hdu.edu.cn.
¹⁸ Supercomputing Center of Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China. 151050064@hdu.edu.cn.
¹⁹ Hithink RoyalFlush Information Network Co., Ltd., Hangzhou 310023, Zhejiang, China. wangjue@sccas.cn.
²⁰ Financial Information Engineering Technology Research Center of Zhejiang Province, Hangzhou 310023, China. wangjue@sccas.cn.
²¹ School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China. yulifeng@myhexin.com.
²² Key Laboratory of Complex Systems Modeling and Simulation, Ministry of Education, Hangzhou 310018, China. yulifeng@myhexin.com.
²³ Computer Science Department, Beijing University of Civil Engineering and Architecture, Beijing 100044, China. lei.zhang@bucea.edu.cn.

Abstract

In order to utilize the distributed characteristic of sensors, distributed machine learning has become the mainstream approach, but the different computing capability of sensors and network delays greatly influence the accuracy and the convergence rate of the machine learning model. Our paper describes a reasonable parameter communication optimization strategy to balance the training overhead and the communication overhead. We extend the fault tolerance of iterative-convergent machine learning algorithms and propose the Dynamic Finite Fault Tolerance (DFFT). Based on the DFFT, we implement a parameter communication optimization strategy for distributed machine learning, named Dynamic Synchronous Parallel Strategy (DSP), which uses the performance monitoring model to dynamically adjust the parameter synchronization strategy between worker nodes and the Parameter Server (PS). This strategy makes full use of the computing power of each sensor, ensures the accuracy of the machine learning model, and avoids the situation that the model training is disturbed by any tasks unrelated to the sensors.

Keywords: disturbed machine learning; dynamic synchronous parallel strategy (DSP); parameter server (PS); sensors.