Convergence of unregularized online learning algorithms.

*(English)*Zbl 06982927Summary: In this paper we study the convergence of online gradient descent algorithms in reproducing kernel Hilbert spaces (RKHSs) without regularization. We establish a sufficient condition and a necessary condition for the convergence of excess generalization errors in expectation. A sufficient condition for the almost sure convergence is also given. With high probability, we provide explicit convergence rates of the excess generalization errors for both averaged iterates and the last iterate, which in turn also imply convergence rates with probability one. To our best knowledge, this is the first high-probability convergence rate for the last iterate of online gradient descent algorithms in the general convex setting. Without any boundedness assumptions on iterates, our results are derived by a novel use of two measures of the algorithm’s one-step progress, respectively by generalization errors and by distances in RKHSs, where the variances of the involved martingales are cancelled out by the
descent
property of the algorithm.

##### MSC:

68T05 | Learning and adaptive systems in artificial intelligence |

##### Software:

Pegasos
PDF
BibTeX
XML
Cite

\textit{Y. Lei} et al., J. Mach. Learn. Res. 18, Paper No. 171, 33 p. (2018; Zbl 06982927)

Full Text:
Link

##### References:

[1] | Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009. · Zbl 1365.94132 |

[2] | L´eon Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91(8), 1991. |

[3] | L´eon Bottou. Online learning and stochastic approximations. On-line Learning in Neural Networks, 17(9):142, 1998. |

[4] | Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004. · Zbl 1295.68182 |

[5] | Di-Rong Chen, Qiang Wu, Yiming Ying, and Ding-Xuan Zhou. Support vector machine soft margin classifiers: error analysis. Journal of Machine Learning Research, 5:1143–1175, 2004. · Zbl 1222.68167 |

[6] | Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3): 273–297, 1995. · Zbl 0831.68098 |

[7] | Felipe Cucker and Ding-Xuan Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, 2007. · Zbl 1274.41001 |

[8] | Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4):1363–1399, 2016. · Zbl 1346.60041 |

[9] | Joseph L Doob. Measure Theory, Graduate Texts in Mathematics. Springer, 1994. |

[10] | John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(Dec):2899–2934, 2009. · Zbl 1235.62151 |

[11] | John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In Conference on Learning Theory, pages 14–26, 2010. |

[12] | Zheng-Chu Guo and Lei Shi. Fast and strong convergence of online learning algorithms. submitted, 2017. |

[13] | Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234, 2016. |

[14] | Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. · Zbl 0127.10602 |

[15] | Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52(8):2165–2176, 2004. 31 · Zbl 1369.68281 |

[16] | Yunwen Lei and Ding-Xuan Zhou. Convergence of online mirror descent algorithms. submitted, 2017. |

[17] | Junhong Lin and Ding-Xuan Zhou. Learning theory of randomized Kaczmarz algorithm. Journal of Machine Learning Research, 16:3341–3365, 2015. · Zbl 1351.62133 |

[18] | Junhong Lin and Ding-Xuan Zhou. Online learning algorithms can converge comparably fast as batch learning. IEEE Transactions on Neural Networks and Learning Systems, in press, 2018. doi: 10.1109/TNNLS.2017.2677970. |

[19] | Junhong Lin, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties and implicit regularization for multiple passes sgm. In International Conference on Machine Learning, pages 2340–2348, 2016. |

[20] | Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011. |

[21] | Klaus-Robert M¨uller, Sebastian Mika, Gunnar Ratsch, Koji Tsuda, and Bernhard Sch¨olkopf.An introduction to kernel-based learning algorithms.IEEE Transactions on Neural Networks, 12(2):181–201, 2001. |

[22] | Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro.Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. · Zbl 1189.90109 |

[23] | Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, Quoc V Le, and Andrew Y Ng. On optimization methods for deep learning. In International Conference on Machine Learning, pages 265–272, 2011. |

[24] | Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In International Conference on Machine Learning, pages 449–456, 2012. |

[25] | Bernhard Sch¨olkopf and Alexander J Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2001. |

[26] | Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical Programming, 127(1):3–30, 2011. · Zbl 1211.90239 |

[27] | Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013. |

[28] | Steve Smale and Yuan Yao. Online learning algorithms. Foundations of Computational Mathematics, 6(2):145–170, 2006. · Zbl 1119.68098 |

[29] | Steve Smale and Ding-Xuan Zhou. Online learning with markov sampling. Analysis and Applications, 7(01):87–113, 2009. 32 · Zbl 1170.68022 |

[30] | Ingo Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2(Nov):67–93, 2001. · Zbl 1009.68143 |

[31] | Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science & Business Media, 2008. · Zbl 1203.68171 |

[32] | Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, pages 1139–1147, 2013. |

[33] | Pierre Tarres and Yuan Yao. Online learning as stochastic approximation of regularization paths: optimality and almost-sure convergence. IEEE Transactions on Information Theory, 60(9):5716–5735, 2014. · Zbl 1360.62192 |

[34] | Yuan Yao. On complexity issues of online learning algorithms. IEEE Transactions on Information Theory, 56(12):6470–6481, 2010. · Zbl 1368.68288 |

[35] | Gui-Bo Ye and Ding-Xuan Zhou. Fully online classification by regularization. Applied and Computational Harmonic Analysis, 23(2):198–214, 2007. · Zbl 1124.68099 |

[36] | Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561–596, 2008. · Zbl 1175.68211 |

[37] | Yiming Ying and Ding-Xuan Zhou. Online regularized classification algorithms. IEEE Transactions on Information Theory, 52(11):4775–4788, 2006. · Zbl 1323.68450 |

[38] | Yiming Ying and Ding-Xuan Zhou. Unregularized online learning algorithms with general loss functions. Applied and Computational Harmonic Analysis, 2(42):224–244, 2017. · Zbl 1382.68204 |

[39] | Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In International Conference on Machine Learning, pages 919–926, 2004. |

[40] | Tong Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Conference on Learning Theory, pages 173–187, 2005. · Zbl 1137.68568 |

[41] | Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pages 928–936, 2003. |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.