Primal Averaging: A New Gradient Evaluation Step to Attain the Optimal Individual Convergence

IEEE Trans Cybern. 2020 Feb;50(2):835-845. doi: 10.1109/TCYB.2018.2874332. Epub 2018 Oct 19.

Abstract

Many well-known first-order gradient methods have been extended to cope with large-scale composite problems, which often arise as a regularized empirical risk minimization in machine learning. However, their optimal convergence is attained only in terms of the weighted average of past iterative solutions. How to make the individual convergence of stochastic gradient descent (SGD) optimal, especially for strongly convex problems has now become a challenging problem in the machine learning community. On the other hand, Nesterov's recent weighted averaging strategy succeeds in achieving the optimal individual convergence of dual averaging (DA) but it fails in the basic mirror descent (MD). In this paper, a new primal averaging (PA) gradient operation step is presented, in which the gradient evaluation is imposed on the weighted average of all past iterative solutions. We prove that simply modifying the gradient operation step in MD by PA strategy suffices to recover the optimal individual rate for general convex problems. Along this line, the optimal individual rate of convergence for strongly convex problems can also be achieved by imposing the strong convexity on the gradient operation step. Furthermore, we extend PA-MD to solve regularized nonsmooth learning problems in the stochastic setting, which reveals that PA strategy is a simple yet effective extra step toward the optimal individual convergence of SGD. Several real experiments on sparse learning and SVM problems verify the correctness of our theoretical analysis.