×

Robust discriminative clustering with sparse regularizers. (English) Zbl 1440.62243

Summary: Clustering high-dimensional data often requires some form of dimensionality reduction, where clustered variables are separated from noise-looking variables. We cast this problem as finding a low-dimensional projection of the data which is well-clustered. This yields a one-dimensional projection in the simplest situation with two clusters, and extends naturally to a multi-label scenario for more than two clusters. In this paper, (a) we first show that this joint clustering and dimension reduction formulation is equivalent to previously proposed discriminative clustering frameworks, thus leading to convex relaxations of the problem; (b) we propose a novel sparse extension, which is still cast as a convex relaxation and allows estimation in higher dimensions; (c) we propose a natural extension for the multi-label scenario; (d) we provide a new theoretical analysis of the performance of these formulations with a simple probabilistic model, leading to scalings over the form \(d=O(\sqrt{n})\) for the affine invariant case and \(d=O(n)\) for the sparse case, where \(n\) is the number of examples and \(d\) the ambient dimension; and finally, (e) we propose an efficient iterative algorithm with running-time complexity proportional to \(O(nd^2)\), improving on earlier algorithms for discriminative clustering with the square loss, which had quadratic complexity in the number of examples.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G35 Nonparametric robustness
PDFBibTeX XMLCite
Full Text: arXiv Link

References:

[1] Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Ivan Laptev, Josef Sivic, and Simon Lacoste-Julien. Unsupervised learning from narrated instruction videos. In Computer Vision and Pattern Recognition (CVPR), 2016.
[2] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007. · Zbl 1302.68273
[3] F. Bach and Z. Harchaoui. DIFFRAC : a discriminative and flexible framework for clustering. In Advances in Advances in Neural Information Processing Systems (NIPS), 2007.
[4] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1-106, 2012.R · Zbl 06064248
[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183-202, 2009. · Zbl 1175.94009
[6] R. Bellman. A note on cluster analysis and dynamic programming. Mathematical Biosciences, 1973. · Zbl 0277.49010
[7] G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny, and K.-R. M¨uller. In search of non-Gaussian components of a high-dimensional distribution. The Journal of Machine Learning Research, 7:247-282, 2006. · Zbl 1222.62009
[8] Piotr Bojanowski, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. Finding actors and actions in movies. In Proc. IEEE International Conference on Computer Vision, 2013.
[9] J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization, volume 3 of CMS Books in Mathematics/Ouvrages de Math´ematiques de la SMC. Springer-Verlag, 2000. Theory and examples. · Zbl 0953.90001
[10] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab Toolbox for Optimization on Manifolds. Journal of Machine Learning Research, 2014. · Zbl 1319.90003
[11] J. Bourgain, V. H. Vu, and P. M. Wood. On the singularity probability of discrete random matrices. Journal of Functional Analysis, 258(2):559-603, 2010. · Zbl 1186.60003
[12] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, 2004. · Zbl 1058.90049
[13] T. De Bie and N. Cristianini. Convex methods for transduction. In Advances in Neural Information Processing Systems (NIPS), pages 73-80, 2003.
[14] F. De la Torre and T. Kanade.Discriminative cluster analysis.In Proceedings of the conference on machine learning (ICML), 2006.
[15] E. Diederichs, A. Juditsky, A. Nemirovski, and V. Spokoiny. Sparse non-Gaussian component analysis by semidefinite programming. Machine learning, 91(2):211-238, 2013. 26 · Zbl 1273.68297
[16] C. Ding and T. Li. Adaptive dimension reduction using discriminant analysis and K-means clustering. In Proceedings of the conference on machine learning (ICML), 2007.
[17] D. Freedman. Statistical models: theory and practice. Cambridge University Press, 2009. · Zbl 1167.62001
[18] J. H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the American statistical Association, 76(376):817-823, 1981.
[19] A. Frieze and M. Jerrum. Improved approximation algorithms for MAX k-CUT and MAX BISECTION. In Integer Programming and Combinatorial Optimization. Springer, 1995. · Zbl 1135.90420
[20] M. R. Garey, D. S. Johnson, and L. Stockmeyer.Some simplified NP-complete graph problems. Theoret. Comput. Sci., 1(3):237-267, 1976. · Zbl 0338.05120
[21] M. X. Goemans and D. P. Williamson. Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems Using Semidefinite Programming. J. ACM, 42(6):1115– 1145, November 1995. · Zbl 0885.68088
[22] J. C. Gower and G. J. S. Ross. Minimum spanning trees and single Linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 18(1), 1969.
[23] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pages 95-110. Springer-Verlag Limited, 2008. · Zbl 1205.90223
[24] M. Grant and S. Boyd. CVX: Matlab Software for Disciplined Convex Programming, version 2.1, March 2014.
[25] Edouard Grave. A convex relaxation for weakly supervised relation extraction. In EMNLP, 2014.
[26] G. Huang, J. Zhang, S. Song, and Z. Chen. Maximin separation probability clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
[27] A. Hyv¨arinen, J. Karhunen, and E. Oja. Independent component analysis, volume 46. John Wiley & Sons, 2004.
[28] A. Joulin and F. Bach. A convex relaxation for weakly supervised classifiers. In Proceedings of the conference on machine learning (ICML), 2012.
[29] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In Proc. CVPR, 2010a.
[30] A. Joulin, J. Ponce, and F. Bach. Efficient optimization for discriminative latent class models. In Advances in Advances in Neural Information Processing Systems (NIPS), 2010b.
[31] M. Journ´ee, F. Bach, P-A Absil, and R. Sepulchre. Low-rank optimization on the cone of positive semidefinite matrices. SIAM Journal on Optimization, 2010. 27 · Zbl 1215.65108
[32] A. T. Kalai, A. Moitra, and G. Valiant. Efficiently learning mixtures of two Gaussians. In International Symposium on Theory of Computing (STOC), pages 553-562, 2010. · Zbl 1293.68229
[33] R. M. Karp.Reducibility among combinatorial problems.In Complexity of computer computations, pages 85-103. Plenum, New York, 1972. · Zbl 1467.68065
[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Advances in Neural Information Processing Systems (NIPS), 2012.
[35] R´emi Lajugie, Piotr Bojanowski, Philippe Cuvillier, Sylvain Arlot, and Francis Bach. A weakly-supervised discriminative model for audio-to-score alignment.In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2016.
[36] N. Le Roux and F. Bach. Local component analysis. In Proceedings of the International Conference on Learning Representations, 2013.
[37] Y.-F. Li, I. W. Tsang, J. T.-Y. Kwok, and Z-H. Zhou. Tighter and convex maximum margin clustering. In AISTATS, pages 344-351, 2009.
[38] Z. Q. Luo, W. K. Ma, A. C. So, Y. Ye, and S. Zhang. Semidefinite relaxation of quadratic optimization problems. Signal Processing Magazine, IEEE, 2010.
[39] J. B. MacQueen. Some Methods for Classification and Analysis of MultiVariate Observations. In Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281-297. University of California Press, 1967. · Zbl 0214.46201
[40] A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of Gaussians. In Annual Symposium on Foundations of Computer Science (FOCS), pages 93-102, 2010.
[41] Y. Nesterov. Smoothing technique and its applications in semidefinite optimization. Math. Program., 2007. · Zbl 1126.90058
[42] A. Y. Ng, M. I. Jordan, and Y. Weiss. On Spectral Clustering: Analysis and an algorithm. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Advances in Neural Information Processing Systems (NIPS). 2002.
[43] G. Niu, B. Dai, L. Shang, and M. Sugiyama. Maximum volume clustering: a new discriminative clustering approach. Journal of Machine Learning Research, 14(1):2641-2687, 2013. · Zbl 1318.62213
[44] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of lloyd-type methods for the k-means problem. In Annual Symposium on Foundations of Computer Science (FOCS), pages 165-176, 2006. · Zbl 1281.68229
[45] P. H Sch¨onemann. A generalized solution of the orthogonal Procrustes problem. Psychometrika, 31(1), 1966.
[46] R. Tibshirani. Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B, 1996. 28 · Zbl 0850.62538
[47] J. Tropp. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., 2012. · Zbl 1259.60008
[48] J. von Neumann. Thermodynamik quantummechanischer Gesamheiten. G¨ott. Nach, (1): 273-291, 1927.
[49] F. Wang, B. Zhao, and C. Zhang. Linear time maximum margin clustering. IEEE Transactions on Neural Networks, 2010.
[50] H. Wang, F. Nie, and H. Huang. Multi-View Clustering and Feature Learning via Structured Sparsity. In Proceedings of the conference on machine learning (ICML), volume 28, 2013.
[51] Z. Wen, D. Goldfarb, and K. Scheinberg. Block coordinate descent methods for semidefinite programming. In Handbook on Semidefinite, Conic and Polynomial Optimization. Springer, 2012. · Zbl 1334.90118
[52] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In Advances in Advances in Neural Information Processing Systems (NIPS), 2004.
[53] J. Ye, Z. Zhao, and M. Wu. Discriminative k-means for clustering. In Advances in Advances in Neural Information Processing Systems (NIPS), 2008.
[54] K. Zhang, I. W. Tsang, and J. T. Kwok. Maximum margin clustering made practical. IEEE Transactions on Neural Networks, 2009. 29
[55] Appendix A. Joint clustering and dimension reduction
[56] Given y, we need to optimize the Rayleigh quotientw>X>yy>Xwwith a rank-one matrix w>X>Xw
[57] in the numerator, which leads to w = (X>X)−1X>y. Given w, we will show that the
[58] averaged distortion measure of K-means once the means have been optimized is exactly
[59] equal to (y>ΠnXw)2/kΠnyk22. Given the data matrix X ∈ Rn×d, K-means to cluster the
[60] data into two components will tend to approximate the data points in X by the centroids
[61] c+∈ Rdand c−∈ Rdsuch that X≈c>+−(y − 1n)c> 22−(since y ∈ {−1, 1}n) =(c>+− c>−) +11 22n(c>++ c>−).
[62] The objective of K-means can now be written as problem KM: minX −(c>+− c>−) −112 y,c+,c−22n(c>++ c>−) F = minX −c>+−(1n− y)c>2 y,c+,c−22− F y,c+,c−2−F2+ 2c>−c+(y + 12n)>(1n2− y) −2 tr X> (y + 1n)c>(1n− y) 2++2c>− y,c+,c−2ny) + kc>−k2F12(n − 1>ny) − 2c>+X> y + 12n  1n− y −2c>−X>. 2 · Zbl 0344.10008
[63] Fixing y and minimizing with respect to c+and c−, we get closed-form expressions for c+
[64] and c−as X>(y + 1n)X>(1n− y) (n + 1>ny)andc−=(n − 1>ny). 30 · Zbl 0458.34030
[65] Substituting these expressions in KM, we have the following optimization problem in y: minkXk2F−1kX>(y + 1n)k2F−1kX>(1n− y)k2F y2(n + 1>y)2(n − 1>y) nn = minkXk2F−1tr XX>(y + 1n)(y + 1n)>−1tr XX>(1n− y)(1n− y)> y2(n + 1>ny)2(n − 1>ny) = minkXk2F−tr XX> y + 1n y + 1n> y(n + 1> ny)22 (n − 1>ny)tr XX> 1n2− y 1n2− y> = mintr XX>−2tr XX> y + 1n y + 1n> y(n + 1>ny)22 (n − 1>ny)tr XX> 1n2− y 1n2− y>  = mintr XX>I −1(yy>+ 1 y2(n + 1>ny)n1>n+ y1>n+ 1ny>) 1 2(n − 1>ny)(1n1>n+ yy>− 1ny>− y1>n). · Zbl 1389.11080
[66] By the centering of X, we have 1>nX = 0 and hence tr XX>1n1>n= tr XX>1ny>=
[67] tr XX>y1>n= 0. Therefore, we obtain mintr XX>I −1(yy>) −1(yy>) ny)2(n − 1>ny) = mintr XX>I − (yy>)1+1 y2(n + 1>ny)2(n − 1>ny) = mintr XX>I − (yy>)n yn2− (1>ny)2) = mintr XX>I −nyy>. yn2− (1>ny)2 · Zbl 0495.35059
[68] Thus we have the equivalent K-means problem as mintr Xww>X>I −nyy>= 1 −max(w>X>y)2. y∈{−1,1}nnn2− (y>1)2y∈{−1,1}nn2− (y>1)2
[69] Thus the averaged distortion measure of K-means with the optimized means is(y>kΠΠnXw)2 nyk22.
[70] Appendix B. Full (unsuccessful) relaxation
[71] It is tempting to find a direct relaxation of Eq. (2). It turns out to lead to a trivial relaxation,
[72] which we outline in this section. When optimizing Eq. (2) with respect to w, we obtain
[73] y∈{−1,1}ny>Πny, leading to a quasi-convex relaxation asmaxY <0,tr Y X(Xtr Π>nX)Y−1X>. diag(Y )=1
[74] Unfortunately, this relaxation always leads to trivial solutions as described below. 31 Consider the quasi-convex relaxation tr Y X(X>X)−1X> max.(27) Y <0,diag(Y )=1tr ΠnY
[75] By definition of Πnthis relaxation is equal to: 1tr Y X(X>X)−1X> Y <0,diag(Y )=1n1 −1>nY 1n. n2
[76] Let A = {Y < 0, diag(Y ) = 1} the feasible set of this problem and define B = {M <}
[77] 0, diag(M ) = 1 +1>nnM 12n
[78] 1 +1>nnM 12n= 1 +n21−1>nY 1>n=1= diag(M ). Reciprocally for M ∈ B, we can define nY 1n1−1>n Y 1n n2
[79] Y =M, such that diag(Y ) = 1 and Y ∈ A and then verify that M =Y. Thus 1+1>n M 1n1−1>n Y 1n n2n2
[80] the problem Eq. (27) is equivalent to the relaxation 1 maxtr M X(X>X)−1X>.(28) 1>n M 1nn M <0,diag(M )=1+ n2
[81] The Lagrangian function of this problem can be written as: L(µ)=tr M X(X>X)−1X>−µ>[diag(M ) − 1n−1>nM 1n1 nn2n] =tr M [X(X>X)−1X>− Diag(µ) +1>nµ1 n2n1>n] +1nµ>1n.
[82] Using L(µ) and the PSD constraint M < 0, the dual problem is given by µ>1n1> mins.t. Diag(µ) −nµ1 µnn2n1>n< X(X>X)−1X>.
[83] Since X(X>X)−1X>< 0, this implies for the dual variable µ: 1>nµ Diag(µ) −1n1>n< 0 ⇔ 1>nDiag(µ)−11n≤n2 n2µ>1n n X1n2 ⇔≤Pn µii=1µi i=1 n 1X11 nµi1Pnµ. i=1ni=1i · Zbl 0766.17002
[84] However for ν ∈ Rn, the harmonic meann1Pni=1ν1−1is always smaller than the arithmetic i
[85] meann1Pni=1νiwith equality if and only if ν = c1nfor c ∈ R. Thus the dual variable µ is constant and the diagonal constraint in problem Eq. (28)
[86] simplifies itself as a trace constraint. Therefore the problem is equivalent to the trivial
[87] relaxation below for which each eigenvector of X(X>X)−1X>is a solution: maxtr M X(X>X)−1X>. M <0, tr(M )=n+1>n M 1nn 32
[88] Appendix C. Equivalent relaxations
[89] In this section, we give details about two equivalent relaxations.
[90] C.1 First equivalent relaxation
[91] We start from the penalized version of Eq. (5), minkΠny − Xvk22+ ν(y>1n)2,(29) y∈{−1,1}n, v∈Rdnn2
[92] which we expand as: mintr Πnyy>−2tr Xvy>+1tr X>Xvv>+ ν(y>1n)2,(30) y∈{−1,1}n, v∈Rdnnnn2
[93] and relax as, using Y = yy>, P = yv>and V = vv>, mintr ΠnY −tr P>X +1tr X>XV + ν1>nY 1ns.t. Y P V,P,Ynnnn2P>V< 0, diag(Y ) = 1. (31)
[94] When optimizing Eq. (31) with respect to V and P , we get exactly Eq. (8). Indeed we
[95] solve this problem by fixing the matrix Y such that Y = Y0and diag(Y0) = 1n. Then the
[96] Lagrangian function of the problem in Eq. (31) can be written as L(A)=tr ΠnY −tr P>X +1tr X>XV + ν1>nY 1n+ tr A(Y − Y nnnn20)  YP 1nΠn+nν21n1>n+ A−1nX P>V−1nX>1nX>X− tr AY0.  Y P P>V< 0, we write the dual problem as mintr AY0s.t.nΠn+−1nν21n1>n+ A−1nX AnX>n1X>X< 0.
[97] From the Schur’s complement condition ofnΠn+−1nν21n1>n+ A−1nX nX>n1X>X< 0, we obtain nn+nν21n1>n+ A <n1X(X>X)−1X>. Substituting the bound for A we get the optimal · Zbl 1382.39017
[98] objective function value D∗=1tr X(X>X)−1X>Y1ν n0−ntr ΠnY0−n21>nY01n.
[99] Note that the optimal dual objective value D∗corresponds to a fixed Y0. Hence by max
[100] imizing with respect to Y we obtain exactly Eq. (8) and therefore, the convex relaxation
[101] in Eq. (11) is equivalent to Eq. (8). Moreover the Karush-Kuhn-Tucker (KKT) conditions
[102] gives P>− X + V X>X = 0 and − Y X + P X>X = 0
[103] Thus the optimum is attained for P = Y X(X>X)−1and V = (X>X)−1X>Y X(X>X)−1. 33 · Zbl 1265.11073
[104] C.2 Second equivalent relaxation
[105] For ν = 1, we solve the problem in Eq. (31) by fixing the matrix V = V0. Then the
[106] Lagrangian function of this problem can be written as
[107] L(µ, B)ˆ=1tr ΠnY −2tr P>X +1tr X>XV + ν1>nY 1n+ µ>(diag(Y ) − 1 nnnn2n) + tr B(V − V0)  YP n1In+ diag(µ)−1nX P>V−1nX>1nX>X + B− µ>1n− tr BV0.  Y P P>V< 0, the dual problem is given by minµ>1n+ tr BV0s.t.nIn−1+ diag(µ)−1nX µ,BnX>1nX>X + B< 0.
[108] From the Schur’s complement condition ofnIn−1+ diag(µ)−1nX nX>1nX>X + B< 0, we obtain
[109] B <n12X>diag(µ + 1n/n)−1X −n1X>X. Substituting the bound for B we get the dual
[110] problem as µn20X>diag(µ + 1n/n)−1X −n1tr V0X>X n1! =⇒minµi+x>iV0xi−1tr V µn2µi+ nn0X>X. i=1
[111] Solving for µi, we get µ∗i=1qx>V1 ni0xi−n.
[112] Substituting µ∗iinto the dual objective function, we get the optimal objective function value n D =ˆ2Xq(XV X>)1 nii− 1 −ntr V0X>X. i=1
[113] Furthermore the KKT conditions give Y diag(ν + 1n/n) −P X>= 0 and P>diag(ν + 1n/n) −1V X>= 0. nn
[114] Thus we obtain the following closed form expressions: P=Diag(diag(XV X>))−1/2XV Y=Diag(diag(XV X>))−1/2XV X>Diag(diag(XV X>))−1/2.
[115] The optimal dual objective value ˆD corresponds to a fixed V0. Therefore, maximizing with
[116] respect to V leads to the problem: n 2Xq1 min1 −(XV X>)ii+tr(V X>X).(32) V <0nn i=1 34
[117] Appendix D. Auxiliary results for Section 5.1
[118] Here we provide auxiliary results for Section 5.1.
[119] D.1 Auxiliary lemma
[120] The matrix X(X>X)−1X>has the following properties (see e.g., (Freedman, 2009)).
[121] Lemma 10 The matrix H = X(X>X)−1X>is the orthogonal projection onto the column
[122] space of the design matrix X since: − H is symmetric. − H is idempotent (H2) = H. − X is invariant under H, that is HX = X.
[123] D.2 Rank-one solution of the relaxation Eq. (8)
[124] We denote by (xi)i=1...nthe lines (or rows) of X.
[125] Lemma 11 The rank-one solution Y∗= yy>is always a solution of the relaxation Eq. (8).
[126] Proof We give an elementary proof of this result without using convex optimization tools.
[127] Using Proposition 1 and Lemma 10 we have Hy = y, thus tr HY∗= tr Hyy>= tr yy>= n.
[128] Moreover all M < 0 can always be decomposed asPni=1λiuiu>iwith λi≥ 0 and (ui)i=1,...,n
[129] an orthonormal family. Since H is an orthogonal projection (ui)>Hui= (Hui)>Hui=
[130] kHuik2≤ kuik2≤ 1. Thus tr HM =Pni=1λitr Hui(ui)>=Pni=1λi(ui)>Hui≤Pni=1λi=
[131] tr M . Then for all matrix M feasible we have tr HM ≤ n since diag(M ) = 1nand tr HY∗= n
[132] which conclude the lemma.
[133] D.3 Rank-one solution of the relaxation Eq. (12)
[134] Lemma 12 The rank-one solution V∗= vv>is always a solution of the relaxation Eq. (12).
[135] Proof The Karush-Kuhn-Tucker (KKT) optimality conditions for the problem are for the
[136] dual variable A 4 0: n 1Xxix>i1 −XX>= A and AV = 0 (Complementary Slackness). i=1x>iV xin q · Zbl 0518.90090
[137] Since x>iw = yi,x>iV∗xi= |yi| = 1, V∗and the dual variable A = 0 satisfy the KKT
[138] conditions and then V∗is solution of this problem. 35
[139] D.4 Proof of Proposition 2
[140] In the following lemma, we use a Taylor expansion to lower-bound f around its minimum.
[141] Lemma 13 For d ≥ 3 and δ ∈ [0, 1). If β ≥ 3 and m2≤2(d+β−4)β−3, then with probability at least 1 − d exp −δ2R2nm4d22, for any
[142] symmetric matrix ∆: f (V∗) − f (V∗+ ∆) > 2(1 − δ)m2k∆k2F+ o(k∆k2) ≥ 0. Otherwise with probability at least 1 − d exp −δ4R2nµ4d12, for any symmetric matrix ∆: f (V∗) − f (V∗+ ∆) > (1 − δ)µ1k∆k2F+ o(k∆k2) ≥ 0,
[143] with µ1≥m2(β−1). Moreover we also have with probability at least 1 − d exp −δ2nµ2, 1+(d+β−2)m24R4d2
[144] for any symmetric matrix ∆ ∈ ∆⊥min: f (V∗) − f (V∗+ ∆) > (1 − δ)µ2k∆k2F+ o(k∆k2) ≥ 0, 10 0cminId−1is defined in the proof
[145] and satisfies m |cmin| ≤. |(d + β − 2)m2− 1|
[146] This lemma directly implies Proposition 2.
[147] Proof q For ∆ ∈ S(d) and δ ∈ R we compute for f (V ) =n1Pni=1x>iV xi, d21Xn(x>∆x f (V + δ∆) = −ii)2. dδ24nq3 i=1x>(V + δ∆)x ii
[148] Thus the second directional derivative in V = V∗along ∆ is d21Xn δ→0dδ2f (V + δ∆) = −4n(x>i∆xi)2. i=1
[149] Let Txbe the semidefinite positive quadratic form of S(d) defined for ∆ ∈ S(d), by Tx: ∆ 7→ (x>∆x)2.(33)
[150] Then there exists a positive linear operator Txfrom S(d) to S(d) such that Tx(∆) =
[151] h∆, Tx∆i. Therefore the function f will be strictly concave if for all directions ∆ ∈ S(d) n 1X Txi(∆) > 0.(34) n i=1 36 We will bound the empirical expectation in Eq. (34) by first showing that its expectation
[152] remains away from 0. Then we will use a concentration inequality for matrices to control
[153] the distance between the sum in Eq. (34) and its expectation. We first derive conditions so that the result is true in expectation, i.e., for the operator
[154] T defined by T = ETxfor x following the same law as (y, z>)>. We denote by m = Ez2
[155] and by β = Ez4/m2its kurtosis. a b> We let ∆ =and then have x>∆x = a + 2yb>z + z>Cz. Thus bC Tx(∆) = a2+ 4ayb>z + 2az>Cz + 4b>(zz>)b + (z>Cz)2+ 4yb>z(z>Cz).
[156] Therefore we can express the value of the operator T only in function of the elements of ∆: T (∆) = (a + m tr C)2+ 4mkbk22+ 2m2kC − Diag(diag(C))k2F+ m2(β − 1)k diag(C)k2,
[157] where we have used E(zCz)2=EXzizjzkzlci,jck,l i,j,k,l =E(zi)4c2i,i+ EXzi2zk2ci,ick,k+ 2EXzi2zj2c2i,j ii,k6=ii,j6=i =βm2Xc2i,i+ m2Xci,ick,k+ 2m2Xc2i,j ii,k6=ii,j6=i =m2(β − 3)Xc2i,i+ m2Xci,ick,k+ 2m2Xc2i,j ii,ki,j =m2(β − 3)k diag(C)k2+ m22kCk2F+ tr(C)2 =m2(β − 3)k diag(C)k2+ m22kC − Diag(diag(C))k2F+ tr(C)2.
[158] Since β ≥ 1, we get T (∆) ≥ (a + m tr C)2+ 4mkbk22+ 2m2(kCk2F− k diag(C)k2). · Zbl 0019.27906
[159] Thus T (∆) = 0 if and only if β = 1 with b = 0d−1and C = diag(c) with c>1d= −ma. With 2
[160] the condition β = 1 meaning that var(z2) = 0 and thus z2is constant a.s., i.e., z follows a
[161] Rademacher law. However we would like to bound T (∆) away from zero by some constant and for that
[162] we are looking for the smallest eigenvalue of the operator ETx. Unfortunately we are not
[163] able to solve the optimization problem minT (∆), ∆∈S(d),k∆k2F=1
[164] and we have to compute all the spectrum of this operator to be able to find the smallest
[165] using ETx∆ = 1/2∇T (∆) . We have a + m tr(C)2mb> 1/2∇T (∆) =2mb(a + m tr(C))m2Id−1+ 2m2C. +m2(β − 3) Diag(diag(C)) 37 0 b> − For all b ∈ Rd−1we have for ∆ =, 1/2∇T (∆) = 2m∆. Thus 2m is an b0 eigenvalue of multiplicity d − 1. 00 − For all C ∈ R(d−1)×(d−1)with diag(C) = 0d−1we have for ∆ =, 1/2∇T (∆) = 0C 2m2∆. Thus 2m2is an eigenvalue of multiplicity(d−1)(d−2)2. 00 − For all c ∈ Rd−1with c>1d−1= 0 we have for ∆ =, 1/2∇T (∆) = 0diag(C) m2(β − 1)∆. Thus m2(β − 1) is an eigenvalue of multiplicity d − 2. a0 − For all a, c ∈ R2we have for ∆ =, 0cId−1 a + m(d − 1)c0 1/2∇T (∆)= 0[ma + m2(d + β − 2)c]Id−1 h1m1> ai =Diagd−1 m1d−1(d + β − 2)m2Id−1c1d−1.  1(d − 1)m m(d + β − 2)m2with an eigenvector [a, c]>would be an a0 eigenvalue of the operator ETxwith a corresponding eigenvector. This 0cId−1 matrix has two simple eigenvalues 1 + (d + β − 2)m2±p(1 + (d + β − 2)m2)2− 4m2(β − 1) µ±=.(35) 2
[166] Moreover when we add all the multiplicity of the found eigenvalues we get d−1+(d−1)(d−2)2+
[167] d−2+2 =d(d+1)2which is the dimension of S(d), therefore we have found all the eigenvalues
[168] of the linear operator ETx. We will prove now than the smallest eigenvalue is µ−when the dimension d is large
[169] enough with regards to m2and 2m2otherwise.
[170] Lemma 14 Let µ1and µ2be the two smallest eigenvalues of the operator ETx. Let us
[171] assume that d ≥ 3 (the case d = 2 will also be done in the proof ). If β ≥ 3 and m2≤2(d+β−4)β−3then µ1= 2m2.
[172] Otherwise m2(β − 1) 1 + (d + β − 2)m2and µ2= min{2m2, m2(β − 1), 2m}. 10 Moreover we denote by ∆min=the eigenvector associated to µ−for 0cminId−1
[173] which we have set without loss of generality the first component a = 1. Then m |(d + β − 2)m2− 1|. 38 Unfortunately µ−can become small when the dimension increases as explained by the
[174] tight bound µ−≥1+(d+β−2)mm2(β−1)2. However the corresponding eigenvector has a particular
[175] structure we will be able to exploit.
[176] Proof First we note that µ−≤ m2(β − 1) and compute µ−≥ 2m2⇔ 1 + (d + β − 2)m2−p(1 + (d + β − 2)m2)2− 4m2(β − 1) − 4m2≥ 0 ⇔ 1 + (d + β − 2)m2− 4m2≥p(1 + (d + β − 2)m2)2− 4m2(β − 1) ⇔ (1 + (d + β − 2)m2− 4m2)2≥ (1 + (d + β − 2)m2)2− 4m2(β − 1) and 1 + (d + β − 6)m2≥ 0 ⇔ 16m4− 8m2(1 + (d + β − 2)m2) ≥ −4m2(β − 1) and 1 + (d + β − 6)m2≥ 0 ⇔ 2(d + β − 4)m2≤ β − 3and 1 + (d + β − 6)m2≥ 0. |{z}|{z} R1R2 − If d = 2, – If β ≤ 3 we have necessary that β ≤ 2 and the first term R1 implies m2≥ 3−βand the second term R2 implies m2≤ 1/(4 − β). Thus we should have 2(2−β) (4−β)(3−β) ≤ 2(2−β) which is not possible since the polynomial β2−5β+8 ≥ 0. – If β ≥ 3, the first term R1 implies m2≤2(β−2)β−3≤ 1 and the second term R2 implies m2≤ 1/(4 − β) ≤2(β−2)β−3≤ 1 for β ≤ 4 and is always satisfied otherwise. − If d ≥ 3, the first term R1 implies that β ≥ 3 for which the second term R2 is always satisfied. It also implies that m2≤2(d+β−4)β−3≤ 1. 10 We denote by ∆min=the eigenvector for which we have set without loss 0cminId−1
[177] of generality a = 1 and −1hpi cmin=((d + β − 2)m2− 1)2+ 4(d − 1)m2− (d + β − 2)m2+ 1. 2(d − 1)m
[178] Consequently cmin≤ 0 and by convexity of the square root we have 2(d − 1)m2 ((d + β − 2)m2− 1)2+ 4(d − 1)m2≤ ((d + β − 2)m2− 1) +. |(d + β − 2)m2− 1|
[179] Thereforem |cmin| ≤. |(d + β − 2)m2− 1| · Zbl 1295.35271
[180] We will control now the behavior of the empirical expectation by its expectation thanks
[181] to concentration theory. By definition Txis a symmetric positive linear operator as its
[182] projection Tx⊥onto the orthogonal space of ∆min. We can thus apply the Matrix Cher
[183] noff inequality from Tropp (2012, Theorem 5.1.1) to these two operators using kTxkop≤ 39
[184] kxx>k2≤ tr(xx>)2≤ kxk42≤ R4d2. Then: !  Xhe−(1−δ)inµ1/(2R4d2) PλminTxk≤ nδµ1≤ d≤ de−(1−δ)2nµ1/(4R4d2), δδ k=1 ! PλminTx⊥k≤ nδµ2≤ dhe−(1−δ)inµ2/(2R4d2)≤ de−(1−δ)2nµ2/(4R4d2), δδ k=1 For m = 1 and d ≥ 3 we have µ1= µ−≥β−1β+d≥ min{β−12β,β−12d} ≥ min{1/3,β−12d}.
[185] D.5 Noise robustness for the one dimensional balanced problem
[186] We want a condition on ε such that the solution of the relaxation Eq. (8) recovers the right
[187] y. We recall the dual problem of the relaxation Eq. (8) min µ>1ns.t. Diag(µ) < X(X>X)−1X>.
[188] The KKT conditions are: − Dual feasibility: Diag(µ) < X(X>X)−1X>. − Primal feasibility: Diag(Y ) = 1nand Y < 0. − Complimentary slackness : Y [Diag(µ) − X(X>X)−1X>] = 0
[189] For Y = yy>a rank one matrix, the last condition implies Diag(µ)y = Hy and (X(X>X)−1X>y)i µi=. yi
[190] For X = y + ε, we denote by ˜y = y + ε, then X(X>X)−1X>=k˜y ˜˜yky>2and X(X>X)−1X>y = y˜>yy. Thus˜ k˜yk2 y˜>yy˜i k˜yk2yi.
[191] Assume that all ˜yiyihave the same sign, without loss of generality we assume ˜yiyi> 0. By
[192] definition of µ, µ ≥ 0. To show the dual feasibility we have to show that Diag(µ) < H which iy˜>yn− Diag(yy˜ii)y ˜y˜˜y>>yDiag(qyy˜ii) < 0 and toP yiy˜i≤ ˜y>y
[193] which is obviously true. Reciprocally if µ is dual feasible then Diag(µ) < 0 and all the ˜yiyi
[194] have the same sign. Therefore we have shown that y is solution of the relaxation Eq. (8) if and only if all
[195] the ˜yiyihave the same sign. If ε and y are independent this is equivalent to kεk∞≤ 1 a.s. 40
[196] D.6 The rank-one candidates are not solutions of the relaxation Eq. (12)
[197] We assume now that 1>ny 6= 0 thus y 6= Πny, which means we do not have the same
[198] proportion in the two clusters. Let us assume that Πny takes two values {πy−, πy+} that
[199] is by definition of Πn: πy+= 1 −1>nnyand πy−= −1 −1>nny. For V∗defined as before, we
[200] get x>iV∗xi= (πyi)2and with I±the set of indices such that Πnyi= πy±respectively, the
[201] KKT conditions for V = V∗can be written as − 1xix>i+X1− 1x nπy+−πy−ix>ii= An4 0 and AnV∗= 0. i∈I+i∈I−
[202] We check that with n±= #{I±}: w>Anw = 0=X1− 1(πy+)2+X1− 1(πy πy+−πy−−)2 i∈I+i∈I− 11 πy+−πy−−)2 =n+πy+− n−πy−− n+(πy+)2+ n−(πy−)2 =y>Πny − (Πny)>Πny = y>Πny − y>Πny = 0. +ii∈I−−xix>i with α+=πy1+− 1 and α−=−πy1−− 1.
[203] Unfortunately α+α−≤ 0, and Anis not necessary negative. Even worse we will show that
[204] EA is not semi-definite negative which will conclude the proof since by the law of large
[205] number lim1A n→∞nn= EA. Assume that the proportions of the two clusters stay constant
[206] with n±= ρ±n, then 0I+ ρ−α−0−)20I.
[207] And ρ+α+(πy+)2+ ρ−α−(πy−)2= 0 since w>Anw = 0. Then ρ+πy−− ρ−πy+− πy+πy− ρ+α++ ρ−α−= πy+πy− −(ρ++ ρ−) −1>ny(ρ =n+− ρ−) + (1 − (1>ny)2) −(1 − (1>ny)2) n =n(ρ+− ρ−) + (1>nny)2)=2(1>nny)2≥ 0. (1 − (1>nny)2)(1 − (1>nny)2)
[208] Thus A =2(1>ny)20 0is not semi-definite negative and V (n2−(1>ny)2)0I∗is not solution of the
[209] relaxation Eq. (12).
[210] Appendix E. Auxiliary results for sparse extension
[211] In this section, we provide some results for sparse extension. 41
[212] E.1 There is a rank-one solution of the relaxation Eq. (18)
[213] Lemma 15 The rank-one solution V∗= v∗v∗>is solution of the relaxation Eq. (18) if the
[214] design matrix X is such thatn1X>X has all its diagonal entries less than one.
[215] Proof The KKT conditions for problem Eq. (18) are n 1Xxix>i1 − λU −X>X = A 4 0 and AV = 0, i=1x>iV xin
[216] with U such that Uij= sign(Vij) if Vij6= 0 and Uij∈ [−1, 1] otherwise. For V∗= v∗v∗>this
[217] gives
[218] A =X>X−λU −1X>X = λhX>X−Uiwith U1,1= 1 and Ui,j∈ [−1, 1] otherwise. nnn
[219] We check that AV∗= 0. If the design matrix X is such thatn1X>X has all its diagonal
[220] entries less than one, we can choose a sub-gradient U such that the dual variable A = 0
[221] and thus V∗is solution. Otherwise by property of semi-definite matrices, there is a diagonal
[222] entry ofn1X>X which is bigger than 1 which prevents A to be semi-definite negative since
[223] the corresponding diagonal entry ofX>nX− U will be positive. This shows that V∗does not
[224] solve the problem.
[225] E.2 Proof of proposition 6
[226] Lemma 16 For δ ∈ [0, 1), with probability 1−5d2exp −δ2n(β−1), for any direction 2dR4(1/m2+β+d)
[227] ∆ such that V∗+ ∆ < 0, we have: hβ − 1(1 + λ)3i
[228] g(V∗)−g(V∗+∆) > (1−δ)λk∆−Diag(∆)k1+k Diag(∆)k22+o(k∆k2) ≥ 0. β + d + 1/m24
[229] Moreover we also have with probability at least 1−5d2exp −δ2nm2dR2(β−1)4, for any symmetric
[230] matrix ∆ such that V∗+ ∆ < 0 and Diag(∆) ∈ (emin)⊥:
[231] g(V∗)−g(V∗+∆) > (1−δ)λk∆−Diag(∆)k1+m2(β−1)(1 + λ)3k Diag(∆)k2i 42+o(k∆k2) ≥ 0,
[232] where emin= [1, cmin1d−1] is defined in the proof and cminsatisfies m |(d + β − 2)m2− 1|.
[233] E.2.1 Proof outline
[234] We will investigate under which conditions on X the solution is unique, first for a deter
[235] ministic design matrix. We make the following deterministic assumptions on X for δ, ζ ≥ 0
[236] and S ⊂ Rd: 42 (A1) kX>nXk∞≤ 1(A3) kZ>nZ− Diag(diag(n1Z>Z))k∞≤ δ (A2) kZ>nyk∞≤ δ(A4) λSminX 2(Xn 2)> ≥ ζ > 0, where we denoted by the Hadamard (i.e., pointwise) product between matrices and
[237] λSminthe minimum eigenvalue of a linear operator restricted to a subspace S. Then with q
[238] g(V ) =n2Pni=1x>iV xi− λkV k1−n1tr X>XV , we can certify that g will decrease around
[239] the solution V∗.
[240] Lemma 17 Let us assume that the noise matrix verifies assumptions (A1,A2,A3,A4), then
[241] for any direction ∆ such that V∗+ ∆ < 0 and diag(∆) ∈ S we have: (1 + λ)3
[242] g(V∗) − g(V∗+ ∆) ≥ λ(1 − δ)k∆ − Diag(diag(∆))k1+ ζk Diag(∆)k22+ o(k∆k2) > 0. 4 Let us assume now that (zi)i=1,.,dare i.i.d of law z symmetric with Ez = Ez3= 0,
[243] Ez= m = 1, Ez4/(Ez2)2= β and such that kzk∞is a.s. bounded by 0 ≤ R ≤ 1. Then
[244] the matrix X satisfies a.s. assumption (A1). Using multiple Hoeffding’s inequalities we will
[245] prove lemma 17.
[246] Lemma 18 If z does not follow a Rademacher law, the design matrix X satisfies assump
[247] tions (A1,A2,A3,A4) with probability greater than 1 − 8d2exp −2d(β+d)Rδ2n(β−1)4 for S = Rd,
[248] and with probability greater than 1 − 8d2exp −δ2n min{β−1,2}2dR4 for S = [1, cmin1d−1]⊥where
[249] cminis defined in the proof and satisfies 1 |emin| ≤. d + β − 3 This lemma concludes the proof of proposition 6. We will now prove these two lemmas.
[250] E.2.2Proof of lemma 17
[251] Proof Since the dual variable A for the PSD constraint is 0 (see the proof of lemma 15),
[252] this constraint V < 0 is not active and we will show that the function decreases in a set of
[253] directions ∆ which include the one for which V∗+ ∆ < 0. a b> bC, with C < 0, which is slightly more q
[254] general than V∗+ ∆ < 0. We denote by f (W ) =2nPni=1x>iW xi−n1tr X>XW the
[255] smooth part of g. By Taylor-Young, we have for all W : f (W ) − f (W + ∆) = −hf0(W ), ∆i −1h∆, f00(W )∆i + o(k∆k2). 2
[256] Thus: g(W ) − g(W + ∆) = −hf0(W ), ∆i −1h∆, f00(W )∆i + λ(kW + ∆k1− kW k1) + o(k∆k2). 2 43
[257] Letting W = V∗this gives with X>X =ny>Z Z>yZ>Z, X>X1
[258] g(W ) − g(W + ∆)=−λh, ∆i −h∆, f00(V∗)∆i + λ(a + 2kbk1+ kCk1) + o(k∆k2) n2 =λ2(kbk1−b>Z>y) + kCk1−1tr(Z>ZC) −1h∆, f00(V∗)∆i + o(k∆k2). nn2
[259] And with H¨older’s inequality and assumption (A2) kbk1−b>Z>y ≥ kbk1(1 − k1Z>yk∞) ≥ (1 − δ)kbk1. nn
[260] Nevertheless we will show in lemma 19 that kCk1−1ntr(Z>ZC) ≥ (1 − δ)kC − diag(C)k1,
[261] thus g(W ) − g(W + ∆) ≥ λ(1 − δ)(2kbk1+ kC − diag(C)k1) + o(k∆k2).(36)
[262] However in Eq. (36), g(W ) − g(W + ∆) = 0 for b = 0 and C diagonal, therefore we have
[263] to investigate second order conditions, i.e., to show for ∆ = diag(e) with e ∈ Rdthat
[264] −h∆, f00(V∗)∆i > 0. And with assumption (A4) n −hdiag(e), f00(V∗) diag(e)i=1X(x> (1 + λ)3nidiag(e)xi)2 i=1 nd 1XX =(ej(xji)2)2 n i=1j=1 n 1X =e>[x 2i(x 2i)>]e n i=1 X 2(X 2)>kek2 ≥ λmin≥ ζkek22. n
[265] Thus we can conclude: (1 + λ)3 g(W ) − g(W + ∆) ≥ λ(1 − δ)(2kbk1+ kC − diag(C)k1) + ζkek22+ o(k∆k2). 4
[266] E.2.3 Auxiliary lemma
[267] Lemma 19 For all matrix C symmetric semi-definite positive we have under assumptions
[268] (A1) and (A3): Z>Z trS −C ≥ (1 − δ)kC − diag(C)k1> 0. n 44
[269] Proof We denote by Σn=Z>nZ. We always have kCk1− tr(ΣnC) = tr(S − Σn)C where
[270] Si,j= sign(Ci,j), thus if diag(C) > 0 then diag(S) = 1 and diag(S − Σn) ≥ 0 from
[271] assumption (A1). Moreover since Σni,j∈ [−1, 1] then sign(S − Σn) = sign(S). Thus tr(S −Σn)C =PiCi,i(S −Σn)i,i+Pi6=jCi,j(S −Σn)i,j≥Pi6=jCi,j(S −Σn)i,j≥ 0.
[272] Furthermore from assumption (A3), |(Σn)i,j| ≤ δ for i 6= j. Therefore tr(S − Σn)C ≥XCi,j(S − Σn)i,j≥X|Ci,j|(1 − δ) ≥ (1 − δ)kC − diag(C)k1> 0. i6=ji6=j
[273] If there is a diagonal element of C which is 0, then the corresponding row and column in C
[274] will also be 0 and we can look at the same problem as before by erasing off from C and Σn
[275] the corresponding column and row.
[276] E.2.4 Proof of lemma 18
[277] ProofWe will first show that the noise matrix Z satisfies assumptions (A2,A3). By
[278] Hoeffding’s inequality we have with probability 1 − 2 exp(−δ2n/(2R2)) n 1X |zji| ≤ δ. n i=1
[279] Then, since the law of z is symmetric yiziwill have the same law as ziand with probability
[280] 1 − 2 exp(−δ2n/(2R2)), the design matrix Z satisfies assumption (A2): Z>y kk∞≤ δ. n
[281] Likewise we have with probability 1 − 2 exp(−δ2n/(2R4)) that for j 6= j0 n 1X |zijzij0| ≤ δ. n i=1
[282] Thus we also have with probability 1 − 2d2exp(−δ2n/(2R4)) that Z satisfies assumption
[283] (A3): kZ>Z − diag(1Z>Z)k∞≤ δ. nn
[284] Thus with probability 1 − 4d2exp(−δ2n/(2R4)), the noise matrix Z satisfies assumptions
[285] (A1, A2, A3). We proceed as in the proof of proposition 2 to show that X satisfies assumption (A4).
[286] We first derive a condition to have the result in expectation, then we use an inequality
[287] concentration on matrix to bound the empirical expectation. This will be very similar, but
[288] we will get a better scaling since ∆ is diagonal. Using the same arguments as in the proof of proposition 2 we have for the diagonal
[289] matrix ∆ = diag(e) with e = (a, c) ∈ Rd: e>E(x 2(x 2)>)e = E(x>∆x)2= (a + mc>1n−1)2+ m2(β − 1)kck22> 0 if β > 1. 45
[290] We can show that m2(β − 1) is an eigenvalue of multiplicity d − 2 and µ±are eigenvalues
[291] of multiplicity one of the operator ∆ 7→ E(x>∆x)2with eigenvectors e±. Thus we have λmin(Ex 2(x 2)>)=1 + (d + β − 2)m2−p(1 + (d + β − 2)m2)2− 4m2(β − 1)(37) 2 m2(β − 1) 1 + (d + β − 2)m2,
[292] and e⊥− λmin(Ex 2(x 2)>) = m2(β − 2).
[293] Moreover d λmaxx 2(x 2)>= (x 2)>x 2=X(xi)4≤ dR4. j=1
[294] Thus we can apply the Matrix Chernoff inequality from (Tropp, 2012) for µS= λSmin(Ex 2(x 2)>): PλSminX 2(X 2)>≤ (1 − δ)µS!≤ de−δ2nµS/(2dR4). n Thus with probability 1 − 5d2exp(−δ2nµ−/(2dR4)) the design matrix X satisfies as
[295] sumptions (A1,A2,A3,A4) with ζ = (1 − δ)µ−and S = Rd. And with probability 1 −
[296] 5d2exp(−δ2n min{β−1, 2}/(2dR4)) the design matrix X satisfies assumptions (A1,A2,A3,A4)
[297] with ζ = (1 − δ) min{β − 1, 2} and S = e⊥−.
[298] Appendix F. Proof of multi-label results
[299] We first prove lemma 7:
[300] Proof Let A ∈ Rk×ksymmetric semi-definite positive such that diag(˜yA˜y>) = 1n, then kk diag(˜yA˜y>) =Xai,i1n+ 2Xa0,iyi+ 2Xai,jyi yj i=0i=11≤i<j≤k
[301] thus kk XXX 2a0,iyi+ 2ai,jyi yj= (1 −ai,i)1n i=11≤i<j≤ki=0
[302] And this system admits as unique solution 0nif and only if the family {1n, (yi)1≤i≤k, (yiyj)1≤i<j≤k}
[303] is linearly independent.
[304] Then we prove the lemma 8:
[305] ProofSince a0+Pki=1a2iαi≥ αminPki=0a2i= αminwe should have α ≥ αmin.We
[306] have already seen that such Y satisfies the constraint. The KKT conditions are: B = 46
[307] diag(µ) − H − ν11>< 0 and BY = 0. Since yi= Πnyi+(y>in1n)1n. Hyi=HΠnyi+ (yi>1n)H1n =Πny 1>nyi =(yi−1n). n
[308] Thus k HY=a2iHyiyi> i=1 k =a2i(yi−1>nyi1n)y> ni i=1 k =a2i(yiy>i−1>nyi1 nnyi>) i=1
[309] and tr(HY ) =Pki=1a2i(n − nαi) = n(1 − a20+ a20− α) = n(1 − α). Furthermore since 1>ndiag(Y ) = n and 1>nM 1n= n2α, for µ = 1nand ν = 1/n, · Zbl 0061.18903
[310] B.Y = n − n(1 − α) − nα = 0. And since B = In−n11n1>n− H, B2= B and B>= B, thus
[311] B is a symmetric projection and consequently symmetric semi-definite positive. Hence the primal variable Y and the dual variables µ = 1nand ν = 1/n satisfy the
[312] KKT conditions, thus Y is solution of this problem.
[313] Appendix G. Efficient optimization problem
[314] We now give the details of an efficient optimization algorithm.
[315] G.1 Dual computation
[316] We consider the following strongly-convex approximation of Eq. (24), augmented with the
[317] von-Neumann entropy: n 1Xq111111
[318] max(XV X>)ii− k Diag(c)V Diag(c)k1− ε tr[(A2V A2) log(A2V A2)] s.t. tr(A2V A2) = 1. V <0n i=1
[319] Introducing dual variables, we have n minmaxui((XV X>)ii) +1− tr CV − ε tr[(A12V A12) log(A12V A12)]
[320] u∈Rn+,C:|Cij|6cicjV <02nui i=1 11 s.t.tr(A2V A2) = 1. 11
[321] By fixing u and C, and letting Q = A2V A2, we can write the max problem as maxtr A−12(1X>Diag(u)X − C)A−12Q − ε tr[Q log(Q)] Q<02n s.t.tr Q = 1. 47
[322] This problem is of the form n X maxtr DQ − εσi(Q) log σi(Q) Q<0 i=1 s.t. tr Q = 1
[323] where D = A−12(2n1X>Diag(u)X − C)A−12and σi(Q) denotes the i-th largest eigen value
[324] of the matrix Q. If we consider the matrix D to be of the form D = U Diag(θ)U>with θ
[325] denoting the vector of ordered eigen values of D, then it turns out that at optimality Q has
[326] the form Q = U Diag(σ)U>, with σ denoting the ordered vector of eigen values of Q. Therefore the above optimization problem can be cast in terms of σ as: n maxθ>σ − εXσilog σi σ∈Rn i=1 n X s.t.σi= 1. i=1
[327] The solution of this problem is σi=Pneθi/ε, which leads to j=1eθj /ε n minφε(θ) = ε logXeθiε. θ∈Rn i=1
[328] In terms of the original matrix variables, we have min φε(D) = ε log tr eDε.
[329] Using the appropriate expansion of D, we have the overall optimization problem as n min+ φε(A−12(1X>Diag(u)X − C)A−12).(38) u∈Rn+,C:|Cij|6cicj2nui2n i=1 At optimality, we have − 1− 1− 1− 1 11(A2 ( 12nX> Diag(u)X−C)A2 )(A2 ( 12nX> Diag(u)X−C)A2 ) A2V A2=eε/ treε.
[330] The error of approximation is at most ε log d and the Lipschitz constant associated with
[331] the function φε(·) is1ε.
[332] G.2 Algorithm details
[333] We write the optimization problem Eq. (38) as: minF (u, C) + H(u, C) u∈Rn+ 48
[334] where H(u, C) = φε(A−12(1X>Diag(u)X − C)A−12) 2n
[335] is the smooth part and n 1X1 F (u, C) = IC:|Cij|6cicj+ 2nui i=1
[336] is the non-smooth part. The gradient ∇uof H(u, C) with respect to u is ∇u= diag(B>U Diag(σ)U>B).
[337] where B =√1A−12X>and the gradient of H(u, C) with respect to C is 2n ∇C= (A−12U Diag(σ)U>A−12). The Lipschitz constant L associated with the gradient ∇H(u, C) is 2 L =maxλmax(B>B B>B), λ2max(A−1),(39) ε
[338] where λmax(M ) denotes the maximum eigen value of matrix M .Computing L takes
[339] O(max(n, d)3) time and L needs to be computed once at the beginning of the algorithm. The resultant FISTA procedure is described in Algorithm 1. Note that the FISTA
[340] procedure first computes intermediate iterates (¯uk−12, ¯Ck−12) (Step 7, Algorithm 1) by taking
[341] descent steps along the respective gradient directions. Then two distinct problems in u and
[342] C (respectively Steps 8 and 9 in Algorithm 1) are solved. The sub-problem in u (Step
[343] 8) can be efficiently solved using a Newton procedure followed by a thresholding step, as
[344] illustrated in Algorithm 2. The sub-problem in C (Step 9) can also be solved using a simple
[345] thresholding step. 49
[346] Algorithm 1 FISTA Algorithm to solve Eq. (38) 1:Input X. 2:Compute Lipschitz constant L. 3:Let (u0, C0) be an arbitrary starting point. 4:Let (¯u0, ¯C0) = (u0, C0), t0= 1. 5:Set the maximum iterations to be K. 6:for k = 1, 2, . . . , K do. The loop can also be terminated based on duality gap. 7:(¯uk−12, ¯Ck−12) =u¯k−1∇ Lu¯k, ¯Ck−L1∇C¯k. n 8:Obtain uk= argminu∈RnLku − ¯uk−12k2+1Pn1
[347] Algorithm 2
[348] Algorithm 2
[349] Algorithm 2
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.