zbMATH — the first resource for mathematics

Dual extrapolation for sparse GLMs. (English) Zbl 07306914
Summary: Generalized Linear Models (GLM) form a wide class of regression and classification models, where prediction is a function of a linear combination of the input variables. For statistical inference in high dimension, sparsity inducing regularizations have proven to be useful while offering statistical guarantees. However, solving the resulting optimization problems can be challenging: even for popular iterative algorithms such as coordinate descent, one needs to loop over a large number of variables. To mitigate this, techniques known as screening rules and working sets diminish the size of the optimization problem at hand, either by progressively removing variables, or by solving a growing sequence of smaller problems. For both techniques, significant variables are identified thanks to convex duality arguments. In this paper, we show that the dual iterates of a GLM exhibit a Vector AutoRegressive (VAR) behavior after sign identification, when the primal problem is solved with proximal gradient descent or cyclic coordinate descent. Exploiting this regularity, one can construct dual points that offer tighter certificates of optimality, enhancing the performance of screening rules and working set algorithms.
68T05 Learning and adaptive systems in artificial intelligence
Full Text: Link
[1] K. Arrow, L. Hurwicz, and H. Uzawa.Studies in Nonlinear Programming. Stanford University Press, 1958.
[2] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsityinducing norms.Foundations and Trends in Machine Learning, 4(1):1-106, 2012.
[3] H. H. Bauschke and P. L. Combettes.Convex analysis and monotone operator theory in Hilbert spaces. Springer, New York, 2011.
[4] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. S. Seljebotn, and K. Smith. Cython: The best of both worlds.Computing in Science Engineering, 13(2):31 -39, 2011.
[5] A. Belloni, V. Chernozhukov, and L. Wang. Square-root Lasso: pivotal recovery of sparse signals via conic programming.Biometrika, 98(4):791-806, 2011.
[6] A. Boisbunon, R. Flamary, and A. Rakotomamonjy. Active set strategy for high-dimensional non-convex sparse optimization problems. InICASSP, pages 1517-1521, 2014.
[7] A. Bonnefoy, V. Emiya, L. Ralaivola, and R. Gribonval. A dynamic screening principle for the lasso. InEUSIPCO, 2014.
[8] S. Boyd and L. Vandenberghe.Convex optimization. Cambridge University Press, 2004.
[9] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. Vanderplas, A. Joly, B. Holt, and G. Varoquaux. API design for machine learning software: experiences from the scikit-learn project.arXiv e-prints, 2013.
[10] E. Cand‘es and B. Recht. Simple bounds for recovering low-complexity models.Mathematical Programming, 141(1-2):577-589, 2013.
[11] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging.J. Math. Imaging Vis., 40(1):120-145, 2011.
[12] S. S. Chen and D. L. Donoho. Atomic decomposition by basis pursuit. InSPIE, 1995.
[13] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex optimization.J. Mach. Learn. Res., 17(83):1-5, 2016.
[14] C. D¨unner, S.Forte, M. Tak´aˇc, and M. Jaggi. Primal-dual rates and certificates. InICML, pages 783-792, 2016.
[15] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning.J. Pacific Optim., 8(4):667-698, 2012.
[16] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification.J. Mach. Learn. Res., 9:1871-1874, 2008.
[17] J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space.J. R. Stat. Soc. Ser. B Stat. Methodol., 70(5):849-911, 2008.
[18] O. Fercoq and P. Bianchi. A coordinate descent primal-dual algorithm with large step size and possibly non separable functions.arXiv preprint arXiv:1508.04625, 2015.
[19] O. Fercoq and P. Richt´arik. Accelerated, parallel and proximal coordinate descent.SIAM J. Optim., 25(3):1997 - 2013, 2015.
[20] O. Fercoq, A. Gramfort, and J. Salmon. Mind the duality gap: safer rules for the lasso. In ICML, pages 333-342, 2015.
[21] J. Friedman, T. J. Hastie, H. H¨ofling, and R. Tibshirani. Pathwise coordinate optimization. Ann. Appl. Stat., 1(2):302-332, 2007.
[22] J. Friedman, T. J. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent.J. Stat. Softw., 33(1):1, 2010.
[23] J.-J. Fuchs. On sparse representations in arbitrary redundant bases.IEEE transactions on Information theory, 50(6):1341-1344, 2004.
[24] A. Gramfort, M. Kowalski, and M. H¨am¨al¨ainen. Mixed-norm estimates for the M/EEG inverse problem using accelerated gradient methods.Phys. Med. Biol., 57(7):1937-1961, 2012.
[25] A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and M. S. H¨am¨al¨ainen. MNE software for processing MEG and EEG data.NeuroImage, 86:446 - 460, 2014.
[26] E. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for‘1-minimization: Methodology and convergence.SIAM J. Optim., 19(3):1107-1130, 2008.
[27] W. L. Hare and A. S. Lewis. Identifying active manifolds.Algorithmic Operations Research, 2(2):75-75, 2007.
[28] J.-B. Hiriart-Urruty and C. Lemar´echal.Convex analysis and minimization algorithms. II, volume 306. Springer-Verlag, Berlin, 1993.
[29] C.-J Hsieh, M. Sustik, I. Dhillon, and P. Ravikumar. QUIC: Quadratic approximation for sparse inverse covariance estimation.J. Mach. Learn. Res., 15:2911-2947, 2014.
[30] T. B. Johnson and C. Guestrin. Blitz: A principled meta-algorithm for scaling sparse optimization. InICML, pages 1171-1179, 2015.
[31] T. B. Johnson and C. Guestrin. StingyCD: Safely avoiding wasteful updates in coordinate descent. InICML, pages 1752-1760, 2017.
[32] T. B. Johnson and C. Guestrin. A fast, principled working set algorithm for exploiting piecewise linear structure in convex problems.arXiv preprint arXiv:1807.08046, 2018.
[33] P. Karimireddy, A. Koloskova, S. Stich, and M. Jaggi. Efficient Greedy Coordinate Descent for Composite Problems.arXiv preprint arXiv:1810.06999, 2018.
[34] K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression.J. Mach. Learn. Res., 8(8):1519-1555, 2007.
[35] M. Kowalski, P. Weiss, A. Gramfort, and S. Anthoine. Accelerating ISTA with an active set strategy. InOPT 2011: 4th International Workshop on Optimization for Machine Learning, page 7, 2011.
[36] S. K. Lam, A. Pitrou, and S. Seibert. Numba: A LLVM-based Python JIT Compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, pages 1-6. ACM, 2015.
[37] J. Lee, Y. Sun, and M. Saunders. Proximal Newton-type methods for convex optimization. InNIPS, pages 827-835, 2012.
[38] J. Mairal.Sparse coding for machine learning, image processing and computer vision. PhD thesis, ´Ecole normale sup´erieure de Cachan, 2010.
[39] M. Massias, A. Gramfort, and J. Salmon. From safe screening rules to working sets for faster lasso-type solvers. In10th NIPS Workshop on Optimization for Machine Learning, 2017.
[40] M. Massias, A. Gramfort, and J. Salmon. Celer: a fast solver for the Lasso with dual extrapolation. InICML, 2018.
[41] P. McCullagh and J.A. Nelder.Generalized Linear Models, Second Edition. Chapman and Hall/CRC Monographs on Statistics and Applied Probability Series. 1989.
[42] D. Myers and W. Shih. A constraint selection technique for a class of linear programs. Operations Research Letters, 7(4):191-195, 1988.
[43] E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. Gap safe screening rules for sparse multi-task and multi-class models. InNIPS, pages 811-819, 2015.
[44] E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. GAP safe screening rules for sparsegroup-lasso. InNIPS, 2016.
[45] E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. Gap safe screening rules for sparsity enforcing penalties.J. Mach. Learn. Res., 18(128):1-33, 2017.
[46] G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems.Statistics and Computing, 20(2):231-252, 2010.
[47] K. Ogawa, Y. Suzuki, and I. Takeuchi. Safe screening of non-support vectors in pathwise SVM computation. InICML, pages 1382-1390, 2013.
[48] F. Palacios-Gomez, L. Lasdon, and M. Engquist. Nonlinear optimization by successive linear programming.Management Science, 28(10):1106-1120, 1982.
[49] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.J. Mach. Learn. Res., 12:2825-2830, 2011.
[50] D. Perekrestenko, V. Cevher, and M. Jaggi. Faster coordinate descent via adaptive importance sampling. InAISTATS, pages 869-877, 2017.
[51] P. Richt´arik and M. Tak´aˇc. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function.Mathematical Programming, 144(1-2):1-38, 2014.
[52] V. Roth and B. Fischer. The group-lasso for generalized linear models: uniqueness of solutions and efficient algorithms. InICML, pages 848-855, 2008.
[53] M. De Santis, S. Lucidi, and F. Rinaldi. A fast active set block coordinate descent algorithm for‘1-regularized least squares.SIAM J. Optim., 26(1):781-809, 2016.
[54] K. Scheinberg and X. Tang.Complexity of inexact proximal Newton methods.arXiv preprint arxiv:1311.6547, 2013.
[55] D. Scieur.Acceleration in Optimization. PhD thesis, ´Ecole normale sup´erieure, 2018.
[56] D. Scieur, A. d’Aspremont, and F. Bach. Regularized nonlinear acceleration. InNIPS, pages 712-720, 2016.
[57] N. Simon, J. Friedman, T. J. Hastie, and R. Tibshirani. A sparse-group lasso.J. Comput. Graph. Statist., 22(2):231-245, 2013. ISSN 1061-8600.
[58] G. Thompson, F. Tonge, and S. Zionts. Techniques for removing nonbinding constraints and extraneous variables from linear programming problems.Management Science, 12 (7):588-608, 1966.
[59] R. Tibshirani. Regression shrinkage and selection via the lasso.J. R. Stat. Soc. Ser. B Stat. Methodol., 58(1):267-288, 1996.
[60] R. Tibshirani, J. Bien, J. Friedman, T. J. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong rules for discarding predictors in lasso-type problems.J. R. Stat. Soc. Ser. B Stat. Methodol., 74(2):245-266, 2012.
[61] R. J. Tibshirani. The lasso problem and uniqueness.Electron. J. Stat., 7:1456-1490, 2013.
[62] R. J. Tibshirani. Dykstra’s Algorithm, ADMM, and Coordinate Descent: Connections, Insights, and Extensions. InNIPS, pages 517-528, 2017.
[63] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization.J. Optim. Theory Appl., 109(3):475-494, 2001.
[64] S. Vaiter, M. Golbabaee, J. Fadili, and G. Peyr´e. Model selection with low complexity priors.Information and Inference: A Journal of the IMA, 4(3):230-287, 2015.
[65] S. Vaiter, G. Peyr´e, and J. M. Fadili. Model consistency of partly smooth regularizers. IEEE Trans. Inf. Theory, 64(3):1725-1737, 2018.
[66] J. Wang, P. Wonka, and J. Ye. Lasso screening rules via dual polytope projection.arXiv preprint arXiv:1211.3966, 2012.
[67] Z. J. Xiang, Y. Wang, and P. J. Ramadge. Screening tests for lasso problems.IEEE Trans. Pattern Anal. Mach. Intell., PP(99), 2016.
[68] G Yuan, C.-H Ho, and C.-J Lin. An improved GLMNET for l1-regularized logistic regression.J. Mach. Learn. Res., 13:1999-2030, 2012.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.