×

zbMATH — the first resource for mathematics

Learning with Fenchel-Young losses. (English) Zbl 07255066
Summary: Over the past decades, numerous loss functions have been been proposed for a variety of supervised learning tasks, including regression, classification, ranking, and more generally structured prediction. Understanding the core principles and theoretical properties underpinning these losses is key to choose the right loss for the right problem, as well as to create new losses which combine their strengths. In this paper, we introduce Fenchel-Young losses, a generic way to construct a convex loss function for a regularized prediction function. We provide an in-depth study of their properties in a very broad setting, covering all the aforementioned supervised learning tasks, and revealing new connections between sparsity, generalized entropies, and separation margins. We show that Fenchel-Young losses unify many well-known loss functions and allow to create useful new ones easily. Finally, we derive efficient predictive and training algorithms, making Fenchel-Young losses appealing both in theory and practice.
MSC:
68T05 Learning and adaptive systems in artificial intelligence
Software:
Scikit; UDPipe
PDF BibTeX XML Cite
Full Text: Link
References:
[1] Ryan P Adams and Richard S Zemel.Ranking via sinkhorn propagation.arXiv e-prints, 2011.
[2] Nir Ailon, Kohei Hatano, and Eiji Takimoto.Bandit online optimization over the permutahedron.Theoretical Computer Science, 650:92-108, 2016. · Zbl 1359.90112
[3] Shun-ichi Amari.Information Geometry and Its Applications. Springer, 2016. · Zbl 1350.94001
[4] Ehsan Amid and Manfred K Warmuth.Two-temperature logistic regression based on the Tsallis divergence.arXiv preprint arXiv:1705.07210, 2017.
[5] Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski.On the equivalence between herding and conditional gradient algorithms.arXiv preprint arXiv:1203.4523, 2012.
[6] Keith Ball, Eric A Carlen, and Elliott H Lieb.Sharp uniform convexity and smoothness inequalities for trace norms.Inventiones Mathematicae, 115(1):463-482, 1994. · Zbl 0803.47037
[7] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh.Clustering with Bregman divergences.JMLR, 6:1705-1749, 2005. · Zbl 1190.62117
[8] Ole Barndorff-Nielsen.Information and Exponential Families: In Statistical Theory. John Wiley & Sons, 1978. · Zbl 1288.62007
[9] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds.Journal of the American Statistical Association, 101(473):138-156, 2006. · Zbl 1118.62330
[10] Leonard E. Baum and Ted Petrie.Statistical inference for probabilistic functions of finite state Markov chains.The Annals of Mathematical Statistics, 37(6):1554-1563, 1966. · Zbl 0144.40902
[11] Heinz H Bauschke and Patrick L Combettes.Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2nd edition, 2017. · Zbl 1359.26003
[12] Amir Beck and Marc Teboulle.A fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2(1):183-202, 2009. · Zbl 1175.94009
[13] Amir Beck and Marc Teboulle.Smoothing and first order methods: A unified framework. SIAM Journal on Optimization, 22(2):557-580, 2012. · Zbl 1251.90304
[14] David Belanger, Dan Sheldon, and Andrew McCallum.Marginal inference in MRFs using Frank-Wolfe. InNeurIPS Workshop on Greedy Opt., FW and Friends, 2013.
[15] Wolfgang H Berger and Frances L Parker.Diversity of planktonic foraminifera in deep-sea sediments.Science, 168(3937):1345-1347, 1970.
[16] Dimitri P Bertsekas.Nonlinear Programming. Athena Scientific Belmont, 1999. · Zbl 1015.90077
[17] Garrett Birkhoff. Tres observaciones sobre el algebra lineal.Univ. Nac. Tucum´an Rev. Ser. A, 5:147-151, 1946. · Zbl 0060.07906
[18] Mathieu Blondel. Structured prediction with projection oracles. InProc. of NeurIPS, 2019. 61
[19] Mathieu Blondel, Vivien Seguy, and Antoine Rolet.Smooth and sparse optimal transport. InProc. of AISTATS, 2018.
[20] Mathieu Blondel, Andr´e FT Martins, and Vlad Niculae.Learning classifiers with FenchelYoung losses: Generalized entropies, margins, and algorithms. InProc. of AISTATS, 2019.
[21] Jonathan Borwein and Adrian S Lewis.Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer Science & Business Media, 2010. · Zbl 0953.90001
[22] Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, 2004. · Zbl 1058.90049
[23] Lev M Bregman.The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming.USSR Computational Mathematics and Mathematical Physics, 7(3):200-217, 1967.
[24] Glenn W Brier.Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1-3, 1950.
[25] Peter Brucker.AnO(n) algorithm for quadratic knapsack problems.Operations Research Letters, 3(3):163-166, 1984. · Zbl 0544.90086
[26] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Ga¨el Varoquaux.API design for machine learning software: experiences from the scikit-learn project. InECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013.
[27] Andreas Buja, Werner Stuetzle, and Yi Shen.Loss functions for binary class probability estimation and classification: Structure and applications. Technical report, University of Pennsylvania, 2005.
[28] Yoeng-Jin Chu and Tseng-Hong Liu. On the shortest arborescence of a directed graph. Science Sinica, 14:1396-1400, 1965. · Zbl 0178.27401
[29] Michael Collins.Discriminative training methods for Hidden Markov Models: Theory and experiments with perceptron algorithms. InProc. of EMNLP, 2002.
[30] Ronan Collobert, Jason Weston, L´eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa.Natural language processing (almost) from scratch.JMLR, 12:2493-2537, 2011. · Zbl 1280.68161
[31] Laurent Condat.Fast projection onto the simplex and the‘1ball.Mathematical Programming, 158(1-2):575-585, 2016. · Zbl 1347.49050
[32] Koby Crammer and Yoram Singer.On the algorithmic implementation of multiclass kernelbased vector machines.JMLR, 2:265-292, 2001. · Zbl 1037.68110
[33] Imre Csisz´ar.I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, pages 146-158, 1975.
[34] Marco Cuturi.Sinkhorn distances: Lightspeed computation of optimal transportation distances. InProc. of NeurIPS, 2013.
[35] Marco Cuturi and Mathieu Blondel.Soft-DTW: A differentiable loss function for time-series. InProc. of ICML, 2017.
[36] John M Danskin.The theory of max-min, with applications.SIAM Journal on Applied Mathematics, 14(4):641-664, 1966. · Zbl 0144.43301
[37] George B Dantzig, Alex Orden, and Philip Wolfe.The generalized simplex method for minimizing a linear form under linear inequality restraints.Pacific Journal of Mathematics, 5(2):183-195, 1955. · Zbl 0064.39402
[38] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien.SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. InProc. of NeurIPS, 2014.
[39] Morris H DeGroot.Uncertainty, information, and sequential experiments.The Annals of Mathematical Statistics, pages 404-419, 1962. · Zbl 0151.22803
[40] Arnaud Dessein, Nicolas Papadakis, and Jean-Luc Rouas.Regularized optimal transport and the Rot Mover’s Distance.arXiv preprint arXiv:1610.06447, 2016. · Zbl 1444.49017
[41] John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra.Efficient projections onto the‘1-ball for learning in high dimensions. InProc. of ICML, 2008.
[42] John C Duchi, Lester W Mackey, and Michael I Jordan. On the consistency of ranking algorithms. InProc. of ICML, pages 327-334, 2010.
[43] John C Duchi, Khashayar Khosravi, and Feng Ruan.Multiclass classification, information, divergence, and surrogate risk.The Annals of Statistics, 46(6B):3246-3275, 2018. · Zbl 1408.62115
[44] Joseph C Dunn and S Harshbarger.Conditional gradient algorithms with open loop step size rules.Journal of Mathematical Analysis and Applications, 62(2):432-444, 1978. · Zbl 0374.49017
[45] Jack Edmonds.Optimum branchings.J. Res. Nat. Bur. Stand., 71B:233-240, 1967.
[46] Jason Eisner.Inside-outside and forward-backward algorithms are just backprop (tutorial paper). InProc. of the Workshop on Structured Prediction for NLP, 2016.
[47] Marguerite Frank and Philip Wolfe.An algorithm for quadratic programming.Naval Research Logistics Quarterly, 3(1-2):95-110, 1956.
[48] Rafael Frongillo and Mark D Reid.Convex foundations for generalized MaxEnt models. In Proc. of AIP, 2014.
[49] Dario Garcia-Garcia and Robert C Williamson.Divergences and risks for multiclass experiments. InProc. of COLT, 2012.
[50] Damien Garreau, R´emi Lajugie, Sylvain Arlot, and Francis Bach.Metric learning for temporal sequence alignment. InProc. of NeurIPS, 2014.
[51] Murray Gell-Mann and Constantino Tsallis.Nonextensive Entropy: Interdisciplinary Applications. Oxford University Press, 2004. · Zbl 1127.82004
[52] Kevin Gimpel and Noah A Smith.Softmax-margin CRFs: Training log-linear models with cost functions. InProc. of NAACL, 2010.
[53] Corrado Gini.Variabilit‘a e mutabilit‘a.Reprinted in Memorie di metodologica statistica (Ed. Pizetti E, Salvemini, T). Rome: Libreria Eredi Virgilio Veschi, 1912.
[54] Tilmann Gneiting and Adrian E Raftery.Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359-378, 2007. · Zbl 1284.62093
[55] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch¨olkopf, and Alexander Smola.A kernel two-sample test.JMLR, 13(Mar):723-773, 2012. · Zbl 1283.62095
[56] Peter D Gr¨unwald and A Philip Dawid.Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory.Annals of Statistics, pages 1367-1433, 2004. · Zbl 1048.62008
[57] Yann Guermeur.VC theory of large margin multi-category classifiers.JMLR, 8:2551-2594, 2007. · Zbl 1222.62070
[58] David P Helmbold and Manfred K Warmuth.Learning permutations with exponential weights.JMLR, 10(Jul):1705-1736, 2009. · Zbl 1235.68092
[59] Peter J Huber.Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1):73-101, 1964. · Zbl 0136.39805
[60] Martin Jaggi.Revisiting Frank-Wolfe: Projection-free sparse convex optimization. InProc. of ICML, 2013.
[61] Thorsten Joachims.Optimizing search engines using clickthrough data. InProc. of ACM SIGKDD. ACM, 2002.
[62] Roy Jonker and Anton Volgenant.A shortest augmenting path algorithm for dense and sparse linear assignment problems.Computing, 38(4):325-340, 1987. · Zbl 0607.90056
[63] Dan Jurafsky and James H Martin.Speech and Language Processing (3rd ed.). draft, 2018.
[64] Sham M Kakade, Karthik Sridharan, and Ambuj Tewari.On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. InProc. of NeurIPS, 2009.
[65] Diederik Kingma and Jimmy Ba.Adam: A method for stochastic optimization. InICLR, 2015.
[66] Eliyahu Kiperwasser and Yoav Goldberg.Simple and accurate dependency parsing using bidirectional LSTM feature representations.TACL, 4:313-327, 2016.
[67] Gustav Kirchhoff. Ueber die aufl¨osung der gleichungen, auf welche man bei der untersuchung der linearen vertheilung galvanischer str¨ome gef¨uhrt wird.Annalen der Physik, 148(12): 497-508, 1847.
[68] Terry Koo, Amir Globerson, Xavier Carreras P´erez, and Michael Collins.Structured prediction models via the matrix-tree theorem. InEMNLP, 2007.
[69] Walid Krichene, Syrine Krichene, and Alexandre Bayen.Efficient Bregman projections onto the simplex. InProc. of CDC. IEEE, 2015. · Zbl 1409.91013
[70] Rahul G Krishnan, Simon Lacoste-Julien, and David Sontag.Barrier Frank-Wolfe for marginal inference. InProc. of NeurIPS, 2015.
[71] Harold W Kuhn.The Hungarian method for the assignment problem.Nav. Res. Log., 2 (1-2):83-97, 1955.
[72] Anastasios Kyrillidis, Stephen Becker, Volkan Cevher, and Christoph Koch.Sparse projections onto the simplex. InProc. of ICML, 2013.
[73] Simon Lacoste-Julien and Martin Jaggi.On the global linear convergence of Frank-Wolfe optimization variants. InProc. of NeurIPS, 2015.
[74] Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher.Block-coordinate Frank-Wolfe optimization for structural SVMs. InProc. of ICML, 2012.
[75] Simon Lacoste-Julien, Fredrik Lindsten, and Francis Bach.Sequential kernel herding: Frank-Wolfe optimization for particle filtering. InProc. of AISTATS, 2015.
[76] John D Lafferty, Andrew McCallum, and Fernando CN Pereira.Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. InProc. of ICML, 2001.
[77] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer.Neural architectures for named entity recognition. InProc. of NAACL, 2016.
[78] Maksim Lapin, Matthias Hein, and Bernt Schiele.Top-k multiclass SVM. InProc. of NeurIPS, 2015. · Zbl 1308.68098
[79] Cong Han Lim and Stephen J Wright. Efficient bregman projections onto the permutahedron and related polytopes. InArtificial Intelligence and Statistics, pages 1205-1213, 2016.
[80] Dong C Liu and Jorge Nocedal.On the limited memory BFGS method for large scale optimization.Mathematical Programming, 45(1):503-528, 1989. · Zbl 0696.90048
[81] Francesco Locatello, Michael Tschannen, Gunnar R¨atsch, and Martin Jaggi.Greedy algorithms for cone constrained optimization with convergence guarantees. InProc. of NeurIPS, 2017.
[82] Olvi L Mangasarian.Pseudo-convex functions.Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 3(2):281-290, 1965. · Zbl 0138.15702
[83] Andr´e FT Martins and Ram´on Fernandez Astudillo.From softmax to sparsemax: A sparse model of attention and multi-label classification. InProc. of ICML, 2016.
[84] Andr´e FT Martins, M´ario AT Figueiredo, Pedro MQ Aguiar, Noah A Smith, and Eric P Xing.Nonextensive information theoretic kernels on measures.JMLR, 10:935-975, 2009a. · Zbl 1235.68174
[85] Andr´e FT Martins, Noah A Smith, and Eric P Xing.Concise integer linear programming formulations for dependency parsing. InACL-IJCNLP, 2009b.
[86] Hamed Masnadi-Shirazi.The design of bayes consistent loss functions for classification. PhD thesis, UC San Diego, 2011.
[87] Peter McCullagh and John A Nelder.Generalized Linear Models, volume 37. CRC press, 1989.
[88] Ryan T McDonald, Fernando CN Pereira, Kiril Ribarov, and Jan Hajiˇc.Non-projective dependency parsing using spanning tree algorithms. InProc. of HLT-EMNLP, 2005.
[89] Arthur Mensch and Mathieu Blondel.Differentiable dynamic programming for structured prediction and attention. InProc. of ICML, 2018.
[90] Jean-Jacques Moreau.Proximit et dualit dans un espace hilbertien.Bullet de la Soci´et´e Math´emathique de France, 93(2):273-299, 1965.
[91] Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, and Jean-Jeacques Slotine. Multiclass learning with simplex coding. InProc. of NeurIPS, pages 2789-2797, 2012.
[92] Renato Negrinho and Andre Martins.Orbit regularization. InProc. of NeurIPS, 2014.
[93] John Ashworth Nelder and R Jacob Baker.Generalized Linear Models. Wiley Online Library, 1972.
[94] Yurii Nesterov.Smooth minimization of non-smooth functions.Mathematical Programming, 103(1):127-152, 2005. · Zbl 1079.90102
[95] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan.On surrogate loss functions andf-divergences.The Annals of Statistics, 37(2):876-904, 2009. · Zbl 1162.62060
[96] Vlad Niculae and Mathieu Blondel.A regularized framework for sparse and structured neural attention. InProc. of NeurIPS, 2017.
[97] Vlad Niculae, Andr´e FT Martins, Mathieu Blondel, and Claire Cardie.SparseMAP: Differentiable sparse structured inference. InProc. of ICML, 2018.
[98] Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning, Ryan T McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al.Universal Dependencies v1: A multilingual treebank collection.InLREC, 2016.
[99] Jorge Nocedal and Stephen Wright.Numerical Optimization. Springer New York, 1999. · Zbl 0930.65067
[100] Richard Nock and Frank Nielsen.Bregman divergences and surrogates for learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11):2048-2059, 2009.
[101] Alex Nowak-Vila, Francis Bach, and Alessandro Rudi. A general theory for structured prediction with smooth convex surrogates.arXiv preprint arXiv:1902.01958, 2019.
[102] Fabian Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.Scikit-learn: Machine Learning in Python. JMLR, 12:2825-2830, 2011. · Zbl 1280.68189
[103] Gabriel Peyr´e and Marco Cuturi.Computational Optimal Transport. Foundations and Trends in Machine Learning, 2017.
[104] Venkata Krishna Pillutla, Vincent Roulet, Sham M Kakade, and Zaid Harchaoui.A Smoother Way to Train Structured Prediction Models. InProc. of NeurIPS, 2018.
[105] Pradeep Ravikumar, Ambuj Tewari, and Eunho Yang. On ndcg consistency of listwise ranking methods. InProc. of AISTATS, pages 618-626, 2011.
[106] Mark D Reid and Robert C Williamson.Composite binary losses.JMLR, 11:2387-2422, 2010. · Zbl 1242.62058
[107] Mark D Reid and Robert C Williamson.Information, divergence and risk for binary experiments.JMLR, 12:731-817, 2011. · Zbl 1280.68192
[108] Alfr´ed R´enyi.On measures of entropy and information. InProc. of the 4th Berkeley Symposium on Mathematics, Statistics, and Probability, volume 1, Berkeley, 1961. University of California Press.
[109] Ryan M Rifkin and Aldebaro Klautau.In defense of one-vs-all classification.JMLR, 5(Jan): 101-141, 2004. · Zbl 1222.68287
[110] Ryan M Rifkin and Ross A Lippert.Value regularization and Fenchel duality.JMLR, 8 (Mar):441-479, 2007. · Zbl 1222.49052
[111] R Tyrrell Rockafellar.Convex Analysis. Princeton University Press, 1970. · Zbl 0193.18401
[112] Frank Rosenblatt.The perceptron: a probabilistic model for information storage and organization in the brain.Psychological Review, 65(6):386, 1958.
[113] Hiroaki Sakoe and Seibi Chiba.Dynamic programming algorithm optimization for spoken word recognition.IEEE Transactions on Acoustics, Speech, and Signal Processing, 26: 43-49, 1978. · Zbl 0371.68035
[114] Leonard J Savage.Elicitation of personal probabilities and expectations.Journal of the American Statistical Association, 66(336):783-801, 1971. · Zbl 0253.92008
[115] Bernhard Sch¨olkopf and Alexander J Smola.Learning With Kernels. The MIT Press, Cambridge, MA, 2002.
[116] Shai Shalev-Shwartz and Sham M Kakade.Mind the duality gap: Logarithmic regret algorithms for online optimization. InProc. of NeurIPS, 2009.
[117] Shai Shalev-Shwartz and Yoram Singer.Convex repeated games and Fenchel duality. In Proc. of NeurIPS, 2007.
[118] Shai Shalev-Shwartz and Yoram Singer.On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms.Machine Learning, 80 (2-3):141-163, 2010.
[119] Shai Shalev-Shwartz and Tong Zhang.Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization.Mathematical Programming, 155(1):105-145, 2016. · Zbl 1342.90103
[120] Claude E Shannon and Warren Weaver.The Mathematical Theory of Communication. University of Illinois Press, Urbana, Illinois, 1949. · Zbl 0041.25804
[121] Chunhua Shen and Hanxi Li.On the dual formulation of boosting algorithms.IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12):2216-2231, 2010.
[122] Richard Sinkhorn and Paul Knopp.Concerning nonnegative matrices and doubly stochastic matrices.Pacific Journal of Mathematics, 21(2):343-348, 1967. · Zbl 0152.01403
[123] David A Smith and Noah A Smith.Probabilistic models of nonprojective dependency trees. InEMNLP, 2007.
[124] Hyun Oh Song, Ross Girshick, Stefanie Jegelka, Julien Mairal, Zaid Harchaoui, and Trevor Darrell. On learning to localize objects with minimal supervision. InProc. of ICML, 2014.
[125] Milan Straka and Jana Strakov´a.Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. InCoNLL Shared Task, 2017.
[126] Daiki Suehiro, Kohei Hatano, Shuji Kijima, Eiji Takimoto, and Kiyohito Nagano.Online prediction under submodular constraints. InInternational Conference on Algorithmic Learning Theory. Springer, 2012. · Zbl 1367.68239
[127] Hiroki Suyari.Generalization of Shannon-Khinchin axioms to nonextensive systems and the uniqueness theorem for the nonextensive entropy.IEEE Trans. Information Theory, 50(8):1783-1787, 2004. · Zbl 1298.94040
[128] Robert E Tarjan.Finding optimum branchings.Networks, 7(1):25-35, 1977.
[129] Ben Taskar.Learning Structured Prediction Models: A Large Margin Approach. PhD thesis, Stanford University, 2004.
[130] Matus Telgarsky.A primal-dual convergence analysis of boosting.JMLR, 13(Mar):561-606, 2012. · Zbl 1283.68307
[131] Ambuj Tewari and Peter L Bartlett.On the consistency of multiclass classification methods. JMLR, 8(May):1007-1025, 2007. · Zbl 1222.62079
[132] Constantino Tsallis.Possible generalization of Boltzmann-Gibbs statistics.Journal of Statistical Physics, 52:479-487, 1988. · Zbl 1082.82501
[133] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun.Large margin methods for structured and interdependent output variables.JMLR, 6:1453-1484, 2005. · Zbl 1222.68321
[134] Nicolas Usunier, David Buffoni, and Patrick Gallinari.Ranking with ordered weighted pairwise classification. InProc. of ICML, 2009.
[135] Leslie G Valiant.The complexity of computing the permanent.Theor. Comput. Sci., 8(2): 189-201, 1979. · Zbl 0415.68008
[136] Vladimir Vapnik.Statistical Learning Theory. Wiley, 1998. · Zbl 0935.62007
[137] Marina Vinyes and Guillaume Obozinski.Fast column generation for atomic norm regularization. InProc. of AISTATS, 2017.
[138] Andrew Viterbi.Error bounds for convolutional codes and an asymptotically optimum decoding algorithm.IEEE Transactions on Information Theory, 13(2):260-269, 1967. · Zbl 0148.40501
[139] Martin J Wainwright and Michael I Jordan.Graphical models, exponential families, and variational inference.Foundations and TrendsRin Machine Learning, 1(1-2):1-305, 2008. · Zbl 1193.62107
[140] Manfred K Warmuth and Dima Kuzmin.Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension.JMLR, 9:2287-2320, 2008. · Zbl 1225.68273
[141] Robert C Williamson, Elodie Vernet, and Mark D Reid.Composite multiclass losses.JMLR, 2016. · Zbl 1437.62241
[142] Philip Wolfe.Finding the nearest point in a polytope.Mathematical Programming, 11(1): 128-149, 1976. · Zbl 0352.90046
[143] Stephen J Wright, Robert D Nowak, and M´ario AT Figueiredo.Sparse reconstruction by separable approximation.IEEE Transactions on Signal Processing, 57(7):2479-2493, 2009. · Zbl 1391.94442
[144] Ronald R Yager.On ordered weighted averaging aggregation operators in multicriteria decisionmaking.IEEE Transactions on Systems, Man, and Cybernetics, 18(1):183-190, 1988. · Zbl 0637.90057
[145] Shota Yasutake, Kohei Hatano, Shuji Kijima, Eiji Takimoto, and Masayuki Takeda.Online linear optimization over permutations. InInternational Symposium on Algorithms and Computation, pages 534-543. Springer, 2011. · Zbl 1350.68291
[146] Daniel Zeman, Martin Popel, Milan Straka, Jan Hajic, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, et al.CoNLL 2017 shared task: Multilingual parsing from raw text to universal dependencies.CoNLL, 2017.
[147] Xiangrong Zeng and M´ario AT Figueiredo.The ordered weighted‘1norm: Atomic formulation and conditional gradient algorithm. InProc. of SPARS, 2015.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.