×

On \(b\)-bit min-wise hashing for large-scale regression and classification with sparse data. (English) Zbl 1468.68170

Summary: Large-scale regression problems where both the number of variables, \(p\), and the number of observations, \(n\), may be large and in the order of millions or more, are becoming increasingly more common. Typically the data are sparse: only a fraction of a percent of the entries in the design matrix are non-zero. Nevertheless, often the only computationally feasible approach is to perform dimension reduction to obtain a new design matrix with far fewer columns and then work with this compressed data.
\(b\)-bit min-wise hashing [P. Li and A. C. König, “Theory and applications of \(b\)-bit minwise hashing”, Commun. ACM 54, No. 8, 101–109 (2011; doi:10.1145/1978542.1978566); P. Li et al. “Hashing algorithms for large-scale learning”, in: Advances in neural information processing systems 24 (NIPS 2011), 9 p. (2011), http://https://proceedings.neurips.cc/paper/2011/hash/0a1bf96b7165e962e90cb14648c9462d-Abstract.html] is a promising dimension reduction scheme for sparse matrices which produces a set of random features such that regression on the resulting design matrix approximates a kernel regression with the resemblance kernel. In this work, we derive bounds on the prediction error of such regressions. For both linear and logistic models, we show that the average prediction error vanishes asymptotically as long as \(q \|\boldsymbol{\beta}^*\|_2^2 /n \rightarrow 0\), where \(q\) is the average number of non-zero entries in each row of the design matrix and \(\boldsymbol{\beta}^*\) is the coefficient of the linear predictor.
We also show that ordinary least squares or ridge regression applied to the reduced data can in fact allow us fit more flexible models. We obtain non-asymptotic prediction error bounds for interaction models and for models where an unknown row normalisation must be applied in order for the signal to be linear in the predictors.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62J07 Ridge regression; shrinkage estimators (Lasso)

Software:

Vowpal Wabbit
PDFBibTeX XMLCite
Full Text: arXiv Link

References:

[1] D. Achlioptas. Database-friendly random projections. In Proceedings of the twentieth ACM symposium on principles of database systems, pages 274–281. ACM, 2001.
[2] F. Bach.Sharp analysis of low-rank kernel matrix approximations.In Conference on Learning Theory, pages 185–209, 2013.
[3] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18:1–38, 2017. · Zbl 1435.65045
[4] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra. Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research, 6:1345–1382, sep 2005. · Zbl 1190.62116
[5] M. Bouchard, A.-L. Jousselme, and P.-E. Dor. A proof for the positive definiteness of the jaccard index matrix. International Journal of Approximate Reasoning, 54(5):615 – 626, 2013. · Zbl 1316.68170
[6] C. Boutsidis and P. Drineas. Random projections for the nonnegative least-squares problem. Linear Algebra and its Applications, 431:760–771, 2009. · Zbl 1167.65032
[7] L. Breiman. Random Forests. Machine Learning, 45:5–32, 2001. · Zbl 1007.68152
[8] A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proceedings of the thirtieth annual ACM symposium on theory of computing, pages 327–336. ACM, 1998. · Zbl 1007.68997
[9] P. B¨uhlmann. Boosting for high-dimensional linear models. Annals of Statistics, 34:559–583, 2006. · Zbl 1095.62077
[10] P. B¨uhlmann and S. van de Geer. Statistics for high-dimensional data. Springer, 2011.
[11] J.L. Carter and M.N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18:143–154, 1979. · Zbl 0412.68090
[12] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE Transactions on Knowledge and Data Engineering, 13:64–78, 2001.
[13] M. Datar and S. Muthukrishnan. Estimating rarity and similarity over data stream windows. Lecture Notes in Computer Science, 2461:323, 2002. · Zbl 1019.68533
[14] P. Drineas, M.W. Michael W Mahoney, S. Muthukrishnan, and T. Sarl´os. Faster least squares approximation. Numerische Mathematik, 117:219–249, 2011. 39 · Zbl 1218.65037
[15] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407–451, 2004. · Zbl 1091.62054
[16] D. Fradkin and D. Madigan. Experiments with random projections for machine learning. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pages 517–522. ACM, 2003.
[17] J. Friedman and B. Popescu. Predictive learning via rule ensembles. Annals of Applied Statistics, 2:916–954, 2008. · Zbl 1149.62051
[18] A.E. Hoerl and R.W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, pages 55–67, 1970. · Zbl 0202.17205
[19] T. Hofmann, B. Sch¨olkopf, and A. Smola. Kernel methods in machine learning. Annals of Statistics, pages 1171–1220, 2008. · Zbl 1151.30007
[20] W.B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26(189-206):1, 1984. · Zbl 0539.46017
[21] I.T. Jolliffe. Principal components in regression analysis. In Principal component analysis, pages 129–155. Springer, 1986.
[22] A. Kaban. New bounds on compressive linear least squares regression. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pages 448–456, 2014.
[23] P. Kar and H. Karnick. Random feature maps for dot product kernels. In Neil D. Lawrence and Mark A. Girolami, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22, pages 583–591, 2012.
[24] J. Langford, L. Li, and A. Strehl. Vowpal wabbit online learning project, 2007.
[25] Q. Le, T. Sarl´os, and A. Smola. Fastfood-computing hilbert space expansions in loglinear time. In Proceedings of the 30th International Conference on Machine Learning, pages 244–252, 2013.
[26] P. Li. Core kernels. arXiv preprint arXiv:1404.6216, 2014.
[27] P. Li and A.C. K¨onig. Theory and applications of b-bit minwise hashing. Communications of the ACM, 54:101–109, 2011.
[28] P. Li, T. Hastie, and K. Church. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 287–296. ACM, 2006.
[29] P. Li, A. Shrivastava, J. Moore, and A. K¨onig. Hashing algorithms for large-scale learning. In Advances in Neural Information Processing Systems, pages 2672–2680, 2011.
[30] P. Li, A. Owen, and C.-H. Zhang.One permutation hashing.In Advances in Neural Information Processing Systems, pages 3122–3130, 2012. 40
[31] P. Li, A. Shrivastava, and A. K¨onig. b-bit minwise hashing in practice. In Proceedings of the 5th Asia-Pacific Symposium on Internetware, page 13. ACM, 2013.
[32] Michael W Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.R · Zbl 1232.68173
[33] O. Maillard and R. Munos. Linear regression with random projections. Journal of Machine Learning Research, 13:2735–2772, 2012. · Zbl 1434.62055
[34] J. Pennington, F. Yu, and S. Kumar. Spherical random features for polynomial kernels. In Advances in Neural Information Processing Systems, pages 1846–1854, 2015.
[35] M. Pilanci and M.J. Wainwright.Iterative hessian sketch: Fast and accurate solution approximation for constrained least-squares. arXiv preprint arXiv:1411.0347, 2014. · Zbl 1360.62400
[36] M. Pilanci and M.J. Wainwright. Randomized sketches of convex programs with sharp guarantees. Information Theory, IEEE Transactions on, 61(9):5096–5115, 2015. · Zbl 1359.90097
[37] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177–1184, 2007.
[38] A. Rahimi and B. Recht. Uniform approximation of functions with random bases. In Communication, Control, and Computing, 2008 46th Annual Allerton Conference on, pages 555–561. IEEE, 2008.
[39] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems, pages 1313–1320, 2009.
[40] A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nystr¨om computational regularization. In Advances in Neural Information Processing Systems, pages 1657–1665, 2015.
[41] Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S. Vishwanathan. Hash kernels for structured data. Journal of Machine Learning Research, 10:2615–2637, 2009. · Zbl 1235.68188
[42] M. Steele. The Cauchy–Schwarz Master Class. Cambridge University Press, 2004. · Zbl 1060.26023
[43] D. J. Sutherland and J. Schneider. On the error of random fourier features. In UAI, 2015.
[44] R. Tibshirani.Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society, Series B, 58:267–288, 1996. · Zbl 0850.62538
[45] J.A. Tropp. Greed is good: algorithmic results for sparse approximation. Information Theory, IEEE Transactions on, 50:2231–2242, 2004. · Zbl 1288.94019
[46] S.A. Van De Geer. High-dimensional generalized linear models and the lasso. Annals of Statistics, 36:614–645, 2008. · Zbl 1138.62323
[47] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(3):480–492, 2012. 41
[48] S. Vempala. The random projection method, volume 65. American Mathematical Society, 2005. · Zbl 1048.68131
[49] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1113–1120. ACM, 2009.
[50] C.K.I. Williams and M. Seeger. Using the nystr¨om method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.
[51] Y. Yang, M. Pilanci, and M.J. Wainwright. Randomized sketches for kernels: Fast and optimal non-parametric regression. Annals of Statistics, 25:991–1023, 2017. · Zbl 1371.62039
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.