zbMATH — the first resource for mathematics

Weighted SGD for \(\ell_p\) regression with randomized preconditioning. (English) Zbl 1448.68381
Summary: In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. SGD methods are easy to implement and applicable to a wide range of convex optimization problems. In contrast, RLA algorithms provide much stronger performance guarantees but are applicable to a narrower class of problems. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems – e.g., \(\ell_2\) and \(\ell_1\) regression problems.
We propose a hybrid algorithm named PWSGD that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditioned system.
By rewriting a deterministic \(\ell_p\) regression problem as a stochastic optimization problem, we connect PWSGD to several existing \(\ell_p\) solvers including RLA methods with algorithmic leveraging (RLA for short).
We prove that PWSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Such SGD convergence rates are superior to other related SGD algorithm such as the weighted randomized Kaczmarz algorithm.
Particularly, when solving \(\ell_1\) regression with size \(n\) by \(d\), PWSGD returns an approximate solution with \(\epsilon\) relative error in the objective value in \(\mathcal{O}(\log n \cdot \mathrm{nnz}(A) + \mathrm{poly}(d)/\epsilon^2)\) time. This complexity is uniformly better than that of RLA methods in terms of both \(\epsilon\) and \(d\) when the problem is unconstrained. In the presence of constraints, PWSGD only has to solve a sequence of much simpler and smaller optimization problem over the same constraints. In general this is more efficient than solving the constrained subproblem required in RLA.
For \(\ell_2\) regression, PWSGD returns an approximate solution with \(\epsilon\) relative error in the objective value and the solution vector measured in prediction norm in \(\mathcal{O}(\log n \cdot \mathrm{nnz}(A) + \mathrm{poly}(d) \log(1/\epsilon) /\epsilon)\) time. We show that for unconstrained \(\ell_2\) regression, this complexity is comparable to that of RLA and is asymptotically better over several state-of-the-art solvers in the regime where the desired accuracy \(\epsilon\), high dimension \(n\) and low dimension \(d\) satisfy \(d\geq 1/\epsilon\) and \(n \geq d^2/\epsilon\).
We also provide lower bounds on the coreset complexity for more general regression problems, indicating that still new ideas will be needed to extend similar RLA preconditioning ideas to weighted SGD algorithms for more general regression problems. Finally, the effectiveness of such algorithms is illustrated numerically on both synthetic and real datasets, and the results are consistent with our theoretical findings and demonstrate that PWSGD converges to a medium-precision solution, e.g., \(\epsilon=10^{-3}\), more quickly.
68T05 Learning and adaptive systems in artificial intelligence
68Q25 Analysis of algorithms and problem complexity
68W20 Randomized algorithms
90C15 Stochastic programming
Full Text: Link
[1] H. Avron, P. Maymounkov, and S. Toledo. Blendenpik: Supercharging LAPACK’s least-squares solver. SIAM J. on Scientific Computing, 32(3):1217–1236, 2010. · Zbl 1213.65069
[2] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, 1994. · Zbl 0814.65030
[3] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Computational Statistics (COMPSTAT), 2010.
[4] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Neural Information Processing Systems (NIPS), 2008.
[5] L. Bottou and Y. Le Cun. Large scale online learning. In Neural Information Processing Systems (NIPS), 2004.
[6] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-Newton method for largescale optimization. SIAM J. on Optimization, 26(2):1008–1031, 2016. · Zbl 1382.65166
[7] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43(1):129–159, 2001. · Zbl 0979.94010
[8] K. L. Clarkson. Subgradient and sampling algorithms for ‘1regression. In Symposium on Discrete Algorithms (SODA), 2005. 26
[9] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing (STOC), 2013. · Zbl 1293.65069
[10] K. L. Clarkson, D. Eppstein, G. L. Miller, C. Sturtivant, and S. Teng. Approximating center points with iterated radon points. In Symposium on Computational Geometry, 1993. · Zbl 0859.68114
[11] K. L. Clarkson, P. Drineas, M. Magdon-Ismail, M. W. Mahoney, X. Meng, and D. P. Woodruff. The Fast Cauchy Transform and faster robust linear regression. In Symposium on Discrete Algorithms (SODA), 2013. · Zbl 1342.68352
[12] M. B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Symposium on Discrete Algorithms (SODA), 2016. · Zbl 1415.65106
[13] M. B. Cohen and R. Peng. ‘prow sampling by Lewis weights. In Symposium on the Theory of Computing (STOC), 2015.
[14] M. B. Cohen, Y. T. Lee, C. Musco, C. Musco, R. Peng, and A. Sidford. Uniform sampling for matrix approximation. In Conference on Innovations in Theoretical Computer Science (ITCS), 2015a. · Zbl 1365.65108
[15] M. B. Cohen, C. Musco, and C. Musco. Ridge leverage scores for low-rank approximation. CoRR, abs/1511.07263, 2015b. · Zbl 1410.68399
[16] F. Curtis. A self-correcting variable-metric algorithm for stochastic optimization. In International Conference on Machine Learning (ICML), 2016.
[17] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney. Sampling algorithms and coresets for ‘pregression. SIAM J. on Computing, 38(5):2060–2078, 2009. · Zbl 1191.68851
[18] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarl´os. Faster least squares approximation. Numer. Math., 117(2):219–249, 2011. · Zbl 1218.65037
[19] P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix coherence and statistical leverage. J. Machine Learning Research, 13:3441–3472, 2012. · Zbl 1437.65030
[20] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In Conference on Learning Theory (COLT), 2010.
[21] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Machine Learning Research, 12:2121–2159, 2011. · Zbl 1280.68164
[22] D. Feldman and M. Langberg. A unified framework for approximating and clustering data. In Symposium on Theory of Computing (STOC), 2011. · Zbl 1288.90046
[23] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 1996. · Zbl 0865.65009
[24] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, 2015. 27 · Zbl 1319.68003
[25] C. Hu, J. T. Kwok, and W. Pan. Accelerated gradient methods for stochastic optimization and online learning. In Neural Information Processing Systems (NIPS), 2009.
[26] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Neural Information Processing Systems (NIPS), 2013.
[27] C. T. Kelley. Iterative Methods for Solving Linear and Nonlinear Equations. SIAM, Philadelphia, 1995. · Zbl 0832.65046
[28] P. Ma, M. W. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning (ICML), 2014. · Zbl 1337.62164
[29] M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning. NOW Publishers, Boston, 2011.
[30] X. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Symposium on the Theory of Computing (STOC), 2013a. · Zbl 1293.68150
[31] X. Meng and M. W. Mahoney. Robust regression on MapReduce. In International Conference on Machine Learning (ICML), 2013b.
[32] X. Meng, M. A. Saunders, and M. W. Mahoney. LSRN: A parallel iterative solver for strongly overor under-determined systems. SIAM J. on Scientific Computing, 36(2):C95–C118, 2014. · Zbl 1298.65053
[33] P. Moritz, R. Nishihara, and M. I. Jordan. A linearly-convergent stochastic L-BFGS algorithm. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
[34] D. Needell, R. Ward, and N. Srebro. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. In Neural Information Processing Systems (NIPS), 2014. · Zbl 1333.65070
[35] J. Nelson and H. L. Nguyen. OSNAP: faster numerical linear algebra algorithms via sparser subspace embeddings. In Symposium on Foundations of Computer Science (FOCS), 2013.
[36] F. Niu, B. Recht, C. R´e, and J. S Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Neural Information Processing Systems (NIPS), 2011.
[37] M. Pilanci and M. J. Wainwright. Iterative Hessian sketch: Fast and accurate solution approximation for constrained least-squares. ArXiv e-prints, 2014. · Zbl 1360.62400
[38] S. Portnoy. On computation of regression quantiles: Making the Laplacian tortoise faster. Lecture Notes-Monograph Series, Vol. 31,L1-Statistical Procedures and Related Topics, pages 187–200, 1997. · Zbl 1089.62512
[39] S. Portnoy and R. Koenker. The Gaussian hare and the Laplacian tortoise: Computability of squarederror versus absolute-error estimators, with discussion. Statistical Science, 12(4):279–300, 1997. · Zbl 0955.62608
[40] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In International Conference on Machine Learning (ICML), 2012.
[41] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, 2003. 28 · Zbl 1031.65046
[42] N. Sauer. On the density of families of sets. J. Combinatorial Theory, Series A, 13(1):145–147. · Zbl 0248.05005
[43] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In International Conference on Machine Learning (ICML), 2008.
[44] S. Shalev-Shwartz and A. Tewari. Stochastic methods for ‘1regularized loss minimization. In International Conference on Machine Learning (ICML), 2009. · Zbl 1280.62081
[45] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub–gradient solver for SVM. In International Conference on Machine Learning (ICML), 2007. · Zbl 1211.90239
[46] C. Sohler and D. P. Woodruff. Subspace embedding for the ‘1-norm with applications. In Symposium on Theory of Computing (STOC), 2011. · Zbl 1288.68276
[47] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl., 15(2), 2009. · Zbl 1169.68052
[48] J. A. Tropp. Improved analysis of the subsampled randomized Hadamard transform. Adv. Adapt. Data Anal., 3(1-2):115–126, 2011. · Zbl 1232.15029
[49] A. van der Sluis. Condition numbers and equilibration of matrices. Numerische Mathematik, 14(1): 14–23, 1969. · Zbl 0182.48906
[50] K. Varadarajan and X. Xiao. On the sensitivity of shape fitting problems. In Foundations of Software Technology and Theoretical Computer Science (FSTTCS), 2012. · Zbl 1359.62266
[51] D. P. Woodruff and Q. Zhang. Subspace embeddings and ‘p-regression using exponential random variables. Conference on Learning Theory (COLT), 2013.
[52] J. Yang, X. Meng, and M. W. Mahoney. Quantile regression for large-scale applications. SIAM J. Scientific Computing, 36(5):S78–S110, 2014. · Zbl 1314.68391
[53] J. Yang, Y. Chow, C. R´e, and M. W. Mahoney. Weighted SGD for ‘pregression with randomized preconditioning. In Symposium on Discrete Algorithms (SODA), 2016a. · Zbl 1411.68129
[54] J. Yang, X. Meng, and M. W. Mahoney. Implementing randomized matrix algorithms in parallel and distributed environments. Proceedings of the IEEE, 104(1):58–92, 2016b.
[55] P. Zhao and T. Zhang. Stochastic optimization with importance sampling. In International Conference on Machine Learning (ICML), 2015.
[56] Appendix A. Supplementary Details of Algorithm 1
[57] As we discussed, we need to compute a well-conditioned basis implicitly and estimate its row
[58] norms, i.e., AR−1and λini=1in Steps 3 and 4 in Algorithm 1. In Section 2.1 we have summarized the major steps for computing the preconditioner using
[59] sketching. Below in Table 5 we provide a short summary of preconditioning methods using various
[60] sketches along with the resulting running time and condition number. Note that the running time
[61] here denotes the total running for computing the matrix R which is the sketching time plus the time 29
[62] for QR factorization of the sketch. Again, below ¯κp(U ) is the condition number of U = AR−1as
[63] defined in Definition 1 and κ(U ) is the standard condition number of U . namerunning timeκ¯1(U ) Dense Cauchy Transform (Sohler and Woodruff, 2011)O(nd2log d + d3log d)O(d5/2log3/2d) Fast Cauchy Transform (Clarkson et al., 2013)O(nd log d + d3log d)O(d11/2log9/2d) Sparse Cauchy Transform (Meng and Mahoney, 2013a)O(nnz(A) + d7log5d)O(d132log112d) Reciprocal Exponential Transform (Woodruff and Zhang, 2013)O(nnz(A) + d3log d)O(d72log52d) Lewis Weights (Cohen and Peng, 2015)O(nnz(A) log n + d3log d)O(d32log12d)
[64] Table 5: Summary of running time and condition number, for several different ‘1conditioning methods. The failure probability of each chmethod is set to be a constant. namerunning timeκ2(U )κ¯2(U )√ Gaussian TransformO(nd2)O(1)O(√d) SRHT (Tropp, 2011)O(nd log n + d3log n log d)O(1)O(√d) Sparse ‘2Embedding (Clarkson and Woodruff, 2013)O(nnz(A) + d4)O(1)O(√d) Sparse ‘2Embedding9(Cohen, 2016)O(nnz(A) log d + d3log d)O(1)O(√d) Refinement Sampling (Cohen et al., 2015a)O(nnz(A) log(n/d) log d + d3log(n/d) log d)O(1)O(d)
[65] Table 6: Summary of running time and condition number, for several different ‘2conditioning methods. Here, we assume d ≤ n ≤ ed. The failure probability of each method is set to be a constant. Next, given the implicit representation of U by R, to compute the leverage scores kUikppexactly,
[66] one has to compute U which takes O(nd2) time. Instead of forming U explicitly and “reading off”
[67] the row norms for computing the leverage scores, one can estimate the row norms of U up to a small
[68] factor by post-multiplying a random projection matrix; see Clarkson et al. (2013); Drineas et al.
[69] (2012) for the cases when p = 1, 2 respectively. The above process can be done in O(nnz(A)·log n)
[70] time. Finally, we present two additional results regarding the non-asymptotic convergence rate of
[71] PWSGD on ‘1and ‘2regression, respectively. Notation is similar to the one used in Proposition 5
[72] and Proposition 6. · Zbl 1113.03006
[73] Proposition 14 For A ∈ Rn×dandb ∈ Rn, definef (x) = kAx − bk1. Algorithm 1 withp = 1
[74] returns a solution vector estimatex that satisfies the following expected error bound¯ 2ηT1k2H+η2(c1αkRF k1)2.(13)
[75] Hereby, the expectation is taken over all the samplesξ1, . . . , ξTandx∗is an optimal solution to the
[76] problemminx∈Zf (x). The constant in the error bound is given by c1=1+γ1−γ. 9. In Cohen (2016), the author analyzes a more general version of the original count-sketch like sparse ‘2embedding (Clarkson and Woodruff, 2013). By setting the sparsity parameter differently, different running time complexities can be achieved. 30
[77] Proposition 15 For A ∈ Rn×dandb ∈ Rn, definef (x) = kAx − bk2. Algorithm 1 withp = 2
[78] returns a solution vector estimatexTthat satisfies the following expected error bound Ekxt− x∗k2H ≤1 − 4η 1 − 2ηc1α2kRF k22!Tkx β2k(RF )−1k220− x∗k2H+2c11 − 2cη¯κ22(U )κ1ηα2(RF )h(y2kRF k22∗). (14)
[79] Hereby,H = (F−1)>F−1is the weighs of the ellipsoidal norm and the expectation is taken over all
[80] the samplesξ1, . . . , ξTandx∗is an optimal solutions to the problemminx∈Zf (x). The constant in
[81] the error bound is given byc1=1+γ. 1−γ
[82] Appendix B. RLA Methods with Algorithmic Leveraging
[83] In this section, we present the RLA sampling algorithms with algorithmic leveraging for solving ‘p
[84] regression problems mentioned in Section 3. The main idea in this class of algorithms is to sample
[85] rows based on the leverage scores of Ab and solve the sample average approximation of the ‘p
[86] regression problem. This method is formally stated in Algorithm 3. The following theorem (from Dasgupta et al. (2009)) states that if the sampling size s is large 
[87] enough, the resulting approximation solution ˆx produces a1+1−-approximation to the original
[88] solution vector. The following theorem also shows that when the desired accuracy and confidence
[89] interval are fixed, the required sampling size only depends on the lower dimension d since α and β
[90] are independent of n.
[91] Theorem 16 Given input matrix A ∈ Rn×dand vectorb ∈ Rn, letα, β be the condition numbers of
[92] the well-conditioned basisU for the range space of Ab and γ be the quality of approximation to
[93] the leverage scores satisfying(6). Then when  < 1/2 and the sampling size satisfies the following
[94] condition 1 + γ(32αβ)p 12 2 s ≥(d + 1) log+ log,(15) 1 − γp22δ
[95] Algorithm 3 returns a solution vectorx that satisfies the following inequality with probability atˆ
[96] least1 − δ,  1 + ε kAˆx − bkp≤kAx∗− bkp,(16) 1 − ε
[97] wherex∗∈ Z is an optimal solution to the original problem minx∈ZkAx − bkp.
[98] Remark. Compared to the RLA algorithm described in Section 3, the algorithm described here
[99] computes the leverage scored based on a basis for the range space of the augmented linear system
[100] A =¯Ab rather than A. One can show similar results if the basis is computed for the range
[101] space of A.
[102] Remark. As can be seen, the sampling size is s = O(poly(d) log(1/)/2) for a target accu
[103] racy . For unconstrained ‘2regression, however, it can be shown that a sampling size s =
[104] O(poly(d) log(1/)/) is sufficient to compute an -approximate solution; see Drineas et al. (2011);
[105] Clarkson and Woodruff (2013) for details.
[106] Appendix C. Proofs
[107] Here, we present the proofs of the theoretical results in the main text. 31
[108] Algorithm 3 RLA methods with algorithmic leveraging for constrained ‘pregression 1:Input: A ∈ Rn×d, b ∈ Rnwith rank( ¯A) = k where ¯A = Ab, Z and s > 0. 2:Output: An approximate solution ˆx ∈ Rdto problem minimizex∈ZkAx − bkpp. 3:Compute R ∈ Rk×(d+1)such that ¯A = U R and U is an (α, β) well-conditioned basis U for the range space of ¯A. 4:Compute or estimate kUikppby λisatisfying (6) with γ, for i ∈ [n]. 5:Let pi=Pnλiλ, for i ∈ [n]. j=1j 6:Let S ∈ Rs×nbe a zero matrix. 7:for i = 1, . . . , s do 8:Pick ξtfrom [n] based on distribution pini=1. 1  9:Set Si,ξt=p1p. ξt 10:end for 11:Return ˆx = arg minx∈ZkSAx − Sbkp.
[109] C.1 Proof of Proposition 7
[110] Consider the following three events: • E1: Compute a matrix R such that U = AR−1has condition number ¯κp, and then compute F = R−1and H = F F>−1. • E2: Given R−1, compute λini=1as an estimation of row norms of AR−1satisfying (6) with γ = 0.5. • E3: For a given basis U with condition number ¯κp(U ) and λini=1with approximation quality γ,PWSGD returns a solution with the desired accuracy with iterations 10T where T is specified in Proposition 5 or Proposition 6. Since each preconditioning method shown in Table 5 successes with constant probability, E1
[111] holds with a constant probability. Also, as introduced in Appendix A, E2has a constant failure
[112] probability. Finally, by Markov inequality, we know that E3holds with probability at least 0.9. As
[113] setting the failure probability of E1and E2to be arbitrarily small will not alter the results in big-O
[114] notation, we can ensure that, with constant probability, E1∩ E2∩ E3holds. Conditioned on the fact that E1∩ E2∩ E3holds, to converge to the desired solution, for ‘1
[115] regression,PWSGD runs in O(d¯κ1(U )/2) iterations. Since all the preconditioning methods in√
[116] Table 6 provide κ(U ) = O(1) and ¯κ2(U ) = O(d), for unconstrained ‘2regression, it runs in
[117] O(d log(1/)/) iterations. For constrained ‘2regression, since an -approximate solution in terms√
[118] of the solution vector measured in the prediction norm implies a-approximate solution on the
[119] objective, it runs in O(d log(1/)/2) iterations to return an -solution in the objective value. The overall complexity is the sum of the complexity needed in each of the above events. For
[120] E1, it is time(R) since the time for computing F and H is O(d3) which can absorbed into time(R)
[121] and they only have to be computed once. For E2, it is O(nnz(A) · log n). Finally, for E3, when the
[122] problem is unconstrained, timeupdate= O(d2); when the problem is constrained, timeupdate=
[123] poly(d). Combining these, we get the complexities shown in the statement. This completes the
[124] proof. 32
[125] C.2 Proof of Proposition 14
[126] The proof of this proposition is structured as follows. First we reformulate the problem using Propo
[127] sition 4. Second we show that the sequence of solution vector estimates xtTt=1in Algorithm 1 is
[128] equivalent to the solution vector estimates ytTt=1obtained by running SGD on the equivalent
[129] problem. Third, we analyze the convergence rate of ytTt=1and conclude the error bound analysis.
[130] Problem reformulationSuppose U is an ‘pwell-conditioned basis for the range space of A and
[131] A = U R for some nonsingular matrix R. Let P be the distribution defined based on the estimation
[132] of the corresponding leverage scores. That is, for i ∈ [n], λi pi=Pn,(17) j=1λj
[133] where λiis an estimation of kUikppsatisfying (1 − γ)kUikpp≤ λi≤ (1 + γ)kUikpp.(18)
[134] This implies 1 + γkU kpp≤ pi≤1 − γkU kikpppp.(19)
[135] From Proposition 4, recall that for any non-singular matrix F ∈ R(d+1)×(d+1), the constrained ‘p
[136] regression problem minf (x) := kAx − bkpp(20) x∈Z
[137] can be equivalently written as the following stochastic optimization problem, minh(y) = kU RF y − bkpp= Eξ∼P[|UξRF y − bξ|p/pξ] .(21) y∈Y
[138] Notice that by comparing to the objective function defined in (1) where f (x) = kAx − bkp, we
[139] rewrite f (x) into the form of the sum of subfunctions, i.e., f (x) = kAx − bkpp, so that SGD can be
[140] applied.
[141] Equivalence of sequencesBy using the following linear transformation, one notices that the se
[142] quence xtTt=1obtained by (11) in Algorithm 1 has a one-to-one correspondence to the sequence
[143] ytTt=1obtained by running SGD on problem (21): F yt=xt, F ¯y=x,¯ F y∗=x∗.(22)
[144] Thus with condition (22), immediately the objective function value has the following equivalence
[145] as well: h(yt)=f (xt), h(¯y)=f (¯x), h(y∗)=f (x∗),(23) 33
[146] where ¯x =T1PTi=1xt, ¯y =T1PTi=1ytand x∗and y∗are the optimal point to optimization problem
[147] (20) and (21) respectively. Now we prove (22) by induction. By defining F y0= x0, one immediately shows that the
[148] equivalence condition holds at the base case (t = 0). Now by induction hypothesis, assume (22)
[149] holds for case t = k. Now for t = k + 1, we show that xk+1returned by Algorithm 1 and yk+1
[150] returned by the update rule of SGD satisfy (22). For simplicity, assume that at k-th iteration, the i-th row is picked. For subfunction hk(y) =
[151] |UiRF y|p− bi/pi, its (sub)gradient is gk(y) = p · sgn(UiRF y − bi) · (UiRF y − bi)p−1· UiRF/pi,(24)
[152] for which the SGD update rule becomes 1 yk+1= arg minηhy − yk, ckUiRF i +kyk− yk22,(25) y∈Y2
[153] where ck= p · sgn(UiRF y − bi) · (UiRF y − bi)p−1/piis the corresponding (sub)gradient. Recall
[154] the linear transformation F yk= xk, feasible set Y = y ∈ Rk|y = F−1x, x ∈ Z and input matrix
[155] Ai= UiR, the update rule (25) becomes 1 xk+1= arg minηckAix +kF−1(xk− x)k22.(26) x∈Z2
[156] The equation above is exactly the update performed in (11). In particular, when Z = Rd, i.e., in the
[157] unconstrained case, (26) has a closed-form solution as shown in (11). From the above analysis on
[158] the equivalence between (25) and (26), one notices xk+1and yk+1satisfy the relationship defined
[159] in (22), i.e., the induction hypothesis holds at t = k + 1. Therefore by induction, we just showed that condition (22), and therefore condition (23), hold
[160] for any t.
[161] Convergence rateBased on the equivalence condition in (23), it is sufficient to analyze the per
[162] formance of sequence ytTt=1. When p = 1, the objective function is non-differentiable. Thus by
[163] substituting the subgradient of an ‘1objective function to the update in (25), one notices that the SA
[164] method simply reduces to stochastic subgradient descent. We now analyze the convergence rate of
[165] running stochastic subgradient descent on problem (21) with p = 1. Suppose the i-th row is picked at the t-th iteration. Recall that the (sub)gradient of the sample
[166] objective |UiRF y − bi|/piin (25) is expressed as gt(y) = sgn(UiRF y − bi) · UiRF/pi.(27)
[167] Hence, by inequality (19), the norm of gt(y) is upper-bounded as follows: kgt(y)k1=kUiRF · sgn(UiRF y − bi)k1/pi 1 + γ|U |11 + γ ≤ |RF |1kUik1·≤ α|RF |1.(28) 1 − γkUik11 − γ 34
[168] In above, we use the property of the well-conditioned basis U . Furthermore by Proposition 17 and
[169] the equivalence condition in (23), for H = F F>−1we have E [f (¯x)] − f (x∗)=E [h(¯y)] − h(y∗)(29) ≤ky∗− y0k22+ηα|RF |11 + γ2 2η(T + 1)21 − γ =kx∗− x0k2H+ηα|RF |11 + γ2,(30) 2η(T + 1)21 − γ
[170] which completes the proof.
[171] C.3 Proof of Proposition 5
[172] By Proposition 17, when the step-size equals to ky∗− y0k21 − γ η =√, α|RF |1T + 11 + γ
[173] the expected error bound is given by E [h(¯y)] − h(y∗)≤ α|RF |1ky√∗− y0k21 + γ.(31) T + 11 − γ
[174] By simple algebraic manipulations, we have that 1 √ky∗k2≤ ky∗k∞= k(RF )−1RF y∗k∞≤ |(RF )−1|1kRF y∗k∞ d ≤ β|(RF )−1|1kU RF y∗k1= c3β|(RF )−1|1h(y∗),(32)
[175] where c3= kU RF y∗k1/h(y∗). In above, we use the property of the well-conditioned basis U . Furthermore from inequality (31) and the equivalence condition in (23), the expected relative
[176] error bound can be upper-bounded by E [f (¯x)] − f (x∗)=E [h(¯y)] − h(y∗) f (x∗)h(y∗) √ c3dβ|(RF )−1|1ky∗− y0k21 + γ ky∗k2α|RF |1√T + 11 − γ √ ky∗− y0k2c3dαβ! ≤ |RF |1|(RF )−1|1√1 + γ.(33) ky∗k2T + 11 − γ
[177] Since the right hand side of the above inequality is a function of stopping time T > 0, for any
[178] arbitrarily given error bound threshold  > 0, by setting the right hand side to be , one obtains the
[179] following stopping condition: √ dαβ √=√,(34) T + 1c1c3c2|RF |1|(RF )−1|1 35
[180] where the above constants are given by c1=, c2==0− y∗k22. 1 − γkx∗k2Hky∗k22
[181] Rearranging the above terms we know that after dα2β2c21c2c23 2|RF |21|(RF )−1|21(35)
[182] iterations, the relative expected error is upper-bounded by  > 0, i.e., E [f (¯x)] − f (x∗)≤ . f (x∗)(36)
[183] This completes the proof.
[184] C.4 Proof of Proposition 15
[185] Similar to the proof of Proposition 14, the proof of this proposition is split into three parts: Problem
[186] reformulation, Equivalence of sequences and Convergence rates. From the proof of Proposi
[187] tion 14, one notices that the proofs in Problem reformulation and Equivalence of sequences hold
[188] for general p, and thus the proofs hold for the case when p = 2 as well. Now we proceed to the
[189] proof of the convergence rate. Again by the equivalence condition, we can show the convergence
[190] rate of solution vector estimate xt by showing the convergence rate achieved by the sequence
[191] yt, i.e., the convergence rate of SGD of problem (21) for p = 2. Throughout the rest of the proof, we denote f (x) = kAx − bk22,h(y) = kAF y − bk22.(37)
[192] Denote by H = F F>−1the weighs of the ellipsoidal norm. Also recall that when the leverage
[193] scores satisfy the error condition in (6), we have the following condition ≤ pi≤ik22.(38) 1 + γkU k221 − γkU k22
[194] Also, we assume that U is (α, β)-conditioned with ¯κ2(U ) = αβ. Based on Definition 1, we have α2=kU k2F,(39) β2=k(U>U )−1k2,(40)
[195] and thus ¯κ22(U ) = k(U>U )−1k2· kU k2F= α2β2.(41) Before deriving the convergence rate, we compute a few constants. µ = 2σmin2(AF ) =2≥2=2, ((U RF )>U RF )−12(U>U )−1· k(RF )−1k22β2· k(RF )−1k22 22 (42) 36
[196] and supLi= sup= supiRF k22≤ 2c iipiipi1kU k2F· kRF k22= 2c1α2· kRF k22,(43)
[197] and n σ2= Ei∼Dkgi(y∗)k2=4X(AiF y∗− bi)2kAiF k2/pi i=1 n X =4(UiRF y∗− bi)2kUiRF k2/pi i=1 n! ≤ 4c1kRF k22kU k2FX(UiRF y∗− bi)2 i=1 =4c1kU k2F· kRF k22· h(y∗) =4c1α2· kRF k22· h(y∗).(44) Equipped with these constant and from Proposition 18, we have the following error bound of
[198] the solution vector estimate ytTt=1generated by the weighted SGD algorithm EkxT− x∗k2H = EkyT− y∗k22 ≤1 − 4ησmin2(AF )1 − η sup2kAiF k22Tky0− y∗k2 ipi2 2ηPni=1(AiF y∗− bi)2kAiF k22/pi + σmin2(AF )(1 − η supi2kApiF k22) i =1 − 4ησmin2(AF )1 − η sup2kAiF k22Tkx ipi0− x∗k2H 2ηPni=1(AiF y∗− bi)2kAiF k22/pi + σmin2(AF )(1 − η supi2kApiF k22) i 1 − 4η 1 − 2ηc1α2kRF k22!T ≤kx0− x∗k2H+2c1η¯κ22(U )κ2(RF )h(y∗). (45) β2k(RF )−1k221 − 2ηc1α2kRF k22
[199] Notice that the above equalities follow from the equivalence condition in (23). Combining the
[200] results from the above parts completes the proof of this lemma.
[201] C.5 Proof of Proposition 6
[202] Throughout the proof, we denote f (x) = kAx − bk22,h(y) = kAF y − bk22.(46)
[203] Denote by H = F F>−1the weights of the ellipsoidal norm. Also recall the following constants
[204] defined in the statement of proposition c1=, c2==0− x∗k2H, ckAx∗k2 1 − γky∗k22kx∗k2H3=f (x∗)2.(47) 37
[205] Before diving into the detailed proof, we first show a useful inequality. c3h(y∗) = c3f (x∗) = kAx∗k22= kU RF y∗k22≥ µky∗k22/2.(48) Now we show the first part. For an arbitrary target error  > 0, using (42), (43), (44) and setting c3 · h(y∗) → (49) kAF k22
[206] in Corollary 19 we have that when the step-size is set to be 1c3 · σmin2(AF ) · h(y∗)/kAF k22 i=1iF y∗− bi)2kAiF k22/pi+ c3 · h(y∗)/kAF k22 σ2min(AF ) supikAipF ki22,(50)
[207] then after log0− y∗k22· c3 · h(y∗)/(kU k22kRF k22) c1α2β2kRF k22k(RF )−1k22+c1α2β4kU k22kRF k42k(RF )−1k42 c3  2kU k22kRF k22· ky0− y∗k22  c3 · h(y∗)c1¯κ22(U )κ2(RF ) +c1κ¯22(U )κ2c(U )κ34(RF ) ≤ logc1κ¯22(U )κ2(RF )1 +κ2(U )κ2(RF )(51) c3
[208] iterations, the sequence ytTk=1generated by running weighted SGD algorithm satisfies the error
[209] bound kyT− y∗k22≤c3 · h(y∗).(52) kAF k22
[210] Notice that in (51), we used (48). From this, we have kA(xT− x∗)k22=kAF F−1(xT− x∗)k22 ≤ kAF k22· kxT− x∗k2H =kAF k22· kyT− y∗k22 =c3 · h(y∗) =kAx∗k22.(53) For the second part, we show the result for general choice of F . The proof is basically the same
[211] as that of the first part except that we set 2h(y∗) → (54) kAF k22
[212] in Corollary 19. The resulting step-size η and number of iterations required T become 12 · σmin2(AF ) · h(y∗)/kAF k22 i=1iF y∗− bi)2kAiF k22/pi+ 2 · h(y∗)/kAF k22 σ2min(AF ) supikAipF ki22(55) 38
[213] and T = logc1κ¯22(U )κ2(RF )1 +κ2(U )κ2(RF ).(56) 2
[214] Setting F = R−1recovers the value of T shown in Proposition 6. The sequence ytTk=1generated
[215] by running weighted SGD algorithm satisfies the error bound kyT− y∗k22≤2h(y∗).(57) kAF k22
[216] Notice that when the problem is unconstrained, by smoothness of the objective h(y), we have h(yT) − h(y∗) ≤ kAF k22· kyT− y∗k22≤ 2h(y∗).(58)
[217] Then by (23), we have f (xT) ≤ (1 + 2)f (x∗) ≤ (1 + 2 + 2)f (x∗).(59)
[218] This implies pp f (xT) ≤ (1 + )f (x∗).(60)
[219] This completes the proof sincepf(x) = kAx − bk2.
[220] C.6 Proof of Theorem 10 · Zbl 0865.68016
[221] Let Gfconsist of mfcopies of gfand G =Sf ∈FGf. We may view the sampling step in Algo
[222] rithm 2 as follows. Sample s items uniformly from G independently with replacement and denote
[223] the corresponding subset of samples by S. Then rescale every function in S by M (F )/s and obtain
[224] D. By Theorem 4.1 in Feldman and Langberg (2011), we know that if the above intermediate set S
[225] is an ( · n/M (F ))−approximation of the set G, then the resulting set D is a desired -coreset for
[226] F . Indeed, S is such a set according to Theorem 6.10 in Feldman and Langberg (2011).
[227] C.7 Proof of Proposition 11
[228] We use A to denote ¯A for the sake of simplicity. Also define the sensitivity at row index i ∈ [n] as |Aix|p si= n · supPn.(61) x∈Cj=1|Ajx|p
[229] Suppose U ∈ Rn×kis an (α, β) well-conditioned basis of the range space of A satisfying A = U R,
[230] where k = rank(A) and R ∈ Rk×(d+1). Then from (61), we have that nx∈CkAxkpp= supx∈CkU RxkiRx|ppp= supy∈C0kU yk|Uiy|ppp≤ supy∈C0kUkykikpqpp/βkykppq= βpkUikpp= βp· λi, (62)
[231] where C0= y ∈ Rd|y = Rx, x ∈ C is a one-to-one mapping. The first inequality follows from
[232] H¨older’s inequality with1p+1q= 1 and the properties of well-conditioned bases. According to the
[233] definition of sensitivity m(fi) = bsic + 1, the above property implies m(fi) ≤ nβpλi+ 1.(63)
[234] which implies M (F ) =Pni=1si≤ (nβpPni=1λi) + n = n((αβ)p+ 1), and completes the proof. 39
[235] C.8 Proof of Proposition 12
[236] According to Definition 9, we only have to show that for any arbitrary constant n and set of points
[237] G = a1, . . . , an ⊆ Rd, the following condition holds: |Range(G, x, r)|x ∈ X , r ≥ 0| ≤ nd+1,
[238] where Range(G, x, r) = ai||a>ix|p≤ r is the region located in the p−norm ellipsoid |a>ix|p= r. 1
[239] Since the following condition holds: ai||a>ix|p≤ r = ai||a>ix| ≤ rp and the constant r is
[240] non-negative and arbitrary. Without loss of generality, we assume p = 1 in the above definition, i.e.,
[241] Range(G, x, r) = ai||a>ix| ≤ r. Notice that for every x and r, Range(G, x, r) is a subset of G. Hence, we may view it as a
[242] binary classifier on G, denoted by cx,r. Given x ∈ X and r ≥ 0, for any ai∈ G we have that ( 1,if |a>ix| ≤ r; cx,r(ai) = 0,otherwise.
[243] Therefore, one immediately sees that |Range(G, x, r)|x ∈ X , r ≥ 0| is the shattering coefficient
[244] of C := cx,r|x ∈ X , r ≥ 0 on n points, denoted by s(C, n). To bound the shattering coefficient
[245] of C, we provide an upper bound based on its VC dimension. We claim that the VC dimension of C is at most d + 1. By contradiction, suppose there exists
[246] n + 2 points such that any labeling on these n + 2 points can be shattered by C. By Radon’s
[247] Theorem (Clarkson et al., 1993), we can partition these points into two disjoint subsets, namely, V
[248] and W with size n1and n2respectively, where the intersection of their convex hulls is nonempty.
[249] Let b be a point located in the intersection of the convex hulls of V and W , which in general can be
[250] written as n1n2 XX b =λivi=σiwi,(64) i=1i=1
[251] where λi≥ 0, σi≥ 0 andPni=11λi=Pni=12σi= 1. By the above assumption, we can find vector x ∈ Rnand nonnegative constant r such that the
[252] following conditions hold: −r ≤ x>vi≤ r, i = 1, . . . , n1;(65) x>wi> r or x>wi< −r,i = 1, . . . , n2.(66)
[253] By combining the conditions in (64), (65) and (66), we further obtain both inequalities −r ≤ b>x ≤ r,(67)
[254] and b>x < −r or b>x > r,(68)
[255] which is clearly paradoxical! This concludes that the VC dimension of C is less than or equal to
[256] d + 1. Furthermore, by Sauer’s Lemma (Sauer), for n ≥ 2 the shattering coefficient s(C, n) =
[257] |Range(G, x, r)|x ∈ X , r ≥ 0| is less than nd+1, which completes the proof of this proposition. 40
[258] C.9 Proof of Proposition 13
[259] Without loss of generality, assume the low dimension d is even (because if d is odd, we can always
[260] add an extra arbitrary row to input matrix A and upper bound the size of the original total sensitivity
[261] set by the same analysis). Let ai∈ [0, 1]dbe a vector with exactly d/2 elements to be 1. For each
[262] i ∈ [n], let Bi= j|aij= 1, where aijdenotes the j-th element of vector ai. For fixed i, define x
[263] as follows, ( 2/d,if j ∈ Bi, xj=(69) −d,otherwise.
[264] One immediately notices from the above expression that x>ai= 1. Thus for j 6= i, aj6= ai, there
[265] exists an index k ∈ [d] such that ajk= 1 but aik= 0. Furthermore the above condition implies ddd x>aj=Xxlajl=Xxlajl+Xxlajl+ xkajk≤ (d/2 − 1)(2/d) − d < 0,(70) l=1l∈Bj,l6=kl6=Bj
[266] which further implies fj(x) = x>aj= 0; Therefore, the i-th sensitivity becomes fi(x) si= supPn≥ 1.(71) xi=jfj(x)
[267] Since the above condition holds for arbitrary index i ∈ [n], and we haved/2d number of vectors
[268] ai, i.e., n =d/2d, this concludes that the size of the total sensitivity set is at leastd/2d ≈ 2d.
[269] Appendix D. Stochastic Gradient Descent
[270] Consider minimizing the following objective minimizex∈Xf (x) = Ei∼P[fi(x)] .(72)
[271] Stochastic gradient descent (SGD) exploits the following update rule 1 xt+1= arg minηhx − xt, gξt(xt)i +kx − xtk22,(73) x∈X2
[272] where ξt∈ [n] is an index drawn according to P , gξt(x) = ∇fξt(x) and Eξt∼P[fξt(x)] = f (x).
[273] When X = Rd, the update rule (73) boils down to xt+1= xt− ηgξt(xt). Note here, if fξt(x) is not
[274] differentiable, we take gξt(xt) to be one of its sub-gradients, i.e., gξt(xt) ∈ ∂fξt(xt). In this case,
[275] SGD boils down to stochastic sub-gradient method. For simplicity, we still refer to the algorithms
[276] as SGD. In the following, we present two results regarding the convergence rate of SGD on problem with
[277] non-strongly convex objective and strongly convex objective, respectively.
[278] D.1 Non-strongly convex case
[279] Here we analyze the case where the objective function f (x) is not strongly convex. Also, each
[280] sub-function is not necessary differentiable. That is, gi(x) can be a sub-gradient of function fiat x. 41
[281] Proposition 17 Assume that12k·k22≥λ2k·k2for some normk·k2. Also assume thatkgt(xt)k∗≤ M
[282] for anyt > 0 where k · k∗is the dual norm ofk · k. The output ¯x =T +11PTt=1xtof SGD satisfies,
[283] for anyy ∈ X , ky − x0k22η E [f (¯x)] − f (y) ≤+M2.(74) 2η(T + 1)2λ ky−x0k2q
[284] In particular, whenη =MT +1λ, we have s 1 E [f (¯x)] − f (y) ≤ M ky − x0k2.(75) (T + 1)λ
[285] Proof From Lemma 1 in Duchi et al. (2010), at step t, we have that η(ft(xt) − ft(y)) ≤ky − xtk22−1ky − xt+1k2η2 222+2λkgt(xt)k2∗.(76)
[286] Conditioned on xt, taking the conditional expectation with respect to ξton both sides, we have  11η2 22t+1k22+2λkgt(xt)k2∗|xt.(77)
[287] Noticing that Eξt∼P[ft(x)] = f (x), we have 11η2 2222λt(xt)k2∗|xt.(78)
[288] Then by taking the expectation over xtand using the fact that kgt(xt)k∗≤ M , we have  1 1η2 222+2λM2.(79)
[289] Summing up the above equation with t = 0, . . . , T and noticing ky − xt+1k22≥ 0, we have T1 ηE [f (xt)] − η(T + 1)f (y) ≤ky − x0k22+η2(T + 1)M2.(80) 22λ t=0
[290] Finally by convexity of f , we have that ky − x0k22η E [f (¯x)] − f (y) ≤+M2.(81) 2η(T + 1)2λ ky−y1k2q
[291] In particular with η =MT +1λ, we have s 1 E [f (¯x)] − f (y) ≤ M ky − x0k2,(82) (T + 1)λ
[292] which completes the proof. 42
[293] D.2 Strongly convex case
[294] Here we analyze the case where the objective function f (x) is strongly convex. We make the
[295] following two assumptions: (A1) Function f (x) is strongly convex with modulus µ. That is, for any x, y ∈ X , µ f (y) ≥ f (x) + h∇f (x), y − xi +ky − xk22.(83) 2 (A2) For each i ∈ [n], the gradient of each sub-function ∇fi(x) is Lipschitz continuous with constant Li. That is, for any x, y ∈ X , k∇fi(y) − ∇fi(x)k2≤ Liky − xk2.(84)
[296] The following results also appeared in Needell et al. (2014).
[297] Proposition 18 Under assumption (A1), (A2), the sequence xt generated by SGD satisfies EkxT− x∗k22 ≤ (1 − 2ηµ(1 − η sup Li))Tkx0− x∗k22+ησ2,(85) µ(1 − η sup Li)
[298] whereσ2= Ei∼Dk∇fi(x∗)k22 and x∗is the optimal solution to(72).
[299] Proof The proof essentially follows the same lines of arguments as in Needell et al. (2014). The
[300] only difference is that, here we are working on the constrained problem where update rule (73) is
[301] equivalent to xt+1= ΠX(xt− ηgt(xt)).(86)
[302] Notice that ΠX(x) is a projection operator to the feasible set X and it is non-expansive. This further
[303] implies kxt+1− x∗k22= kΠX(xt− ηgt(xt)) − x∗k22≤ kxt− ηgt(xt) − x∗k22.(87)
[304] The rest of the proof follows analogous arguments in Needell et al. (2014).
[305] Corollary 19 Given a target accuracy  > 0, and let the step-size be η =2σ2+2µ sup Lµ. Then after i  2kx0− x∗k2  σ2sup Li T ≥ log+(88) µ2µ
[306] iterations, we have that EkxT− x∗k22 ≤ .(89)
[307] Proof The proof
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.