Dynamics of Learning in MLP: Natural Gradient and Singularity Revisited

Shun-Ichi Amari; Tomoko Ozeki; Ryo Karakida; Yuki Yoshida; Masato Okada

doi:10.1162/neco_a_01029

Dynamics of Learning in MLP: Natural Gradient and Singularity Revisited

Neural Comput. 2018 Jan;30(1):1-33. doi: 10.1162/neco_a_01029. Epub 2017 Oct 24.

Authors

Shun-Ichi Amari¹, Tomoko Ozeki², Ryo Karakida³, Yuki Yoshida⁴, Masato Okada⁵

Affiliations

¹ RIKEN Brain Science Institute, Wako-shi, Saitama 351-0198, Japan amari@brain.riken.jp.
² Tokai University, Hiratsuka-shi, Kanagawa 259-1292, Japan tozeki@tokai.ac.jp.
³ National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan karakida.ryo@aist.go.jp.
⁴ University of Tokyo, Kashiwashi, Chiba 277-8561, Japan yoshida@mns.k.u-tokyo.ac.jp.
⁵ University of Tokyo, Kashiwashi, Chiba 277-8561, Japan okada@k.u-tokyo.ac.jp.

PMID: 29064781
DOI: 10.1162/neco_a_01029

Abstract

The dynamics of supervised learning play a main role in deep learning, which takes place in the parameter space of a multilayer perceptron (MLP). We review the history of supervised stochastic gradient learning, focusing on its singular structure and natural gradient. The parameter space includes singular regions in which parameters are not identifiable. One of our results is a full exploration of the dynamical behaviors of stochastic gradient learning in an elementary singular network. The bad news is its pathological nature, in which part of the singular region becomes an attractor and another part a repulser at the same time, forming a Milnor attractor. A learning trajectory is attracted by the attractor region, staying in it for a long time, before it escapes the singular region through the repulser region. This is typical of plateau phenomena in learning. We demonstrate the strange topology of a singular region by introducing blow-down coordinates, which are useful for analyzing the natural gradient dynamics. We confirm that the natural gradient dynamics are free of critical slowdown. The second main result is the good news: the interactions of elementary singular networks eliminate the attractor part and the Milnor-type attractors disappear. This explains why large-scale networks do not suffer from serious critical slowdowns due to singularities. We finally show that the unit-wise natural gradient is effective for learning in spite of its low computational cost.