Li, Keren; Yang, Jie Score-matching representative approach for big data analysis with generalized linear models. (English) Zbl 1493.62449 Electron. J. Stat. 16, No. 1, 592-635 (2022). Summary: We propose a fast and efficient strategy, called the representative approach, for big data analysis with generalized linear models, especially for distributed data with localization requirements or limited network bandwidth. With a given partition of massive dataset, this approach constructs a representative data point for each data block and fits the target model using the representative dataset. In terms of time complexity, it is as fast as the subsampling approaches in the literature. As for efficiency, its accuracy in estimating parameters given a homogeneous partition is comparable with the divide-and-conquer method. Supported by comprehensive simulation studies and theoretical justifications, we conclude that mean representatives (MR) work fine for linear models or generalized linear models with a flat inverse link function and moderate coefficients of continuous predictors. For general cases, we recommend the proposed score-matching representatives (SMR), which may improve the accuracy of estimators significantly by matching the score function values. As an illustrative application to the Airline on-time performance data, we show that the MR and SMR estimates are as good as the full data estimate when available. Cited in 1 Document MSC: 62J12 Generalized linear models (logistic models) 62R07 Statistical aspects of big data and data science Keywords:big data regregression; user data localization; distributed database; mean representative approach; divide and conquer; subsampling Software:CoCoA; dobson × Cite Format Result Cite Review PDF Full Text: DOI arXiv Link References: [1] Battey, H., Fan, J., Liu, H., Lu, J. and Zhu, Z. (2018). Distributed testing and estimation under sparse high dimensional models. Annals of statistics 46 1352- 1382. · Zbl 1392.62060 [2] Bowman, C. (January 6, 2017). Data Localization Laws: an Emerging Global Trend. Jurist. https://www.jurist.org/commentary/2017/01/Courtney-Bowman-data-localization/. [3] Chen, X., Liu, W. and Zhang, Y. (2019). Quantile regression under memory constraint. Annals of Statistics 47 3244-3273. · Zbl 1436.62134 [4] Chen, X. and Xie, M. G. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica 24 1655-1684. · Zbl 1480.62258 [5] Dobson, A. J. and Barnett, A. G. (2018). An Introduction to Generalized Linear Models, 4 ed. Chapman & Hall/CRC. · Zbl 1412.62001 [6] Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., Foufou, S. and Bouras, A. (2014). A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing 2 267-279. [7] Fang, K. T. and Wang, Y. (1994). Number-theoretic methods in statistics. Chapman & Hall. · Zbl 0925.65263 [8] Fefer, R. F. (March 26, 2020). Data Flows, Online Privacy, and Trade Policy (CRS Report No. RL45584). Congressional Research Service. https://crsreports.congress.gov/product/pdf/R/R45584. Accessed October 11, 2020. [9] Flury, B. A. (1990). Principal points. Biometrika 77 33-41. · Zbl 0691.62053 [10] He, L., Bian, A. and Jaggi, M. (2018). COLA: Decentralized Linear Learning. NIPS’18 4541-4551. Curran Associates Inc., Red Hook, NY, USA. [11] Huggins, J. H., Adams, R. P. and Broderick, T. (2017). PASS-GLM: polynomial approximate sufficient statistics for scalable Bayesian GLM inference. Advances in Neural Information Processing Systems 3611-3621. [12] Kane, M. J., Emerson, J. and Weston, S. (2013). Scalable strategies for computing with massive data. Journal of Statistical Software 55 1-19. [13] Kanniappan, P. and Sastry, S. M. A. (1983). Uniform convergence of convex optimization problems. Journal of Mathematical Analysis and Applications 96.1 1-12. [14] Keeley, S., Zoltowski, D., Yu, Y., Smith, S. and Pillow, J. (2020). Efficient non-conjugate Gaussian process factor models for spike count data using polynomial approximations. In International Conference on Machine Learning 5177-5186. PMLR. [15] Kotsiantis, S. and Kanellopoulos, D. (2006). Discretization Techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering 32(1) 47-58. [16] Lee, J. D., Liu, Q., Sun, Y. and Taylor, J. E. (2017). Communication-efficient sparse regression. Journal of Machine Learning Research 18 115-144. · Zbl 1434.62157 [17] Lin, J. and Rosasco, L. (2017). Optimal Rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research 18 1-47. · Zbl 1435.68272 [18] Lin, N. and Xi, R. (2011). Aggregated estimating equation estimation. Statistics and Its Interface 4 73-83. · Zbl 1245.62026 [19] Ma, P. and Sun, X. (2014). Leveraging for big data regression. WIREs Computational Statistics 7 70-76. · Zbl 1545.62097 [20] Mak, S. and Joseph, V. R. (2017). Projected support points: a new method for high-dimensional data reduction. arXiv preprint arXiv:1708.06897. [21] Mak, S. and Joseph, V. R. (2018). Support points. Annals of Statistics 46 2562-2592. · Zbl 1408.62030 [22] McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, 2 ed. Chapman and Hall/CRC. · Zbl 0744.62098 [23] Özsu, M. T. and Valduriez, P. (2011). Principles of Distributed Database Systems, 3 ed. Springer Science & Business Media. [24] Pakhira, M. K. (2014). A linear time-complexity k-means algorithm using cluster shifting. 2014 Sixth International Conference on Computational Intelligence and Communication Networks 1047-1051. [25] Raykov, Y. P., Boukouvalas, A., Baig, F. and Little, M. A. (2016). What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm. PLoS ONE 11 e0162259. [26] Resnick, S. I. (1999). A Probability Path. Birkhäuser, Boston, MA. · Zbl 0944.60002 [27] Schechner, S. and Glazer, E. (September 9, 2020). Ireland to Order Facebook to Stop Sending User Data to U.S. The Wall Street Journal. https://www.wsj.com/articles/ireland-to-order-facebook-to-stop-sending-user-data-to-u-s-11599671980. [28] Schifano, E. D., Wu, J., Wang, C., Yan, J. and Chen, M. H. (2016). Online updating of statistical inference in the big data setting. Technometrics 58 393-403. [29] Shi, C., Lu, W. and Song, R. (2018). A massive data framework for M-estimators with cubic-rate. Journal of the American Statistical Association 113 1698-1709. · Zbl 1409.62105 [30] Shin, S. J., Wu, Y., Zhang, H. H. and Liu, Y. (2014). Probability-enhanced sufficient dimension reduction for binary classification. Biometrika 70 546-555. · Zbl 1299.62132 [31] Shin, S. J., Wu, Y., Zhang, H. H. and Liu, Y. (2017). Principal weighted support vector machines for sufficient dimension reduction in binary classification. Biometrika 104, 1 67-81. · Zbl 1506.62332 [32] Singh, K., Xie, M. G. and Strawderman, W. (2005). Combining Information from Independent Sources through Confidence Distributions. Annals of Statistics 33 159-183. · Zbl 1064.62003 [33] Smith, V., Forte, S., Chenxin, M., Takáč, M., Jordan, M. I. and Jaggi, M. (2018). CoCoA: A general framework for communication-efficient distributed optimization. Journal of Machine Learning Research 18 1-49. · Zbl 1473.68167 [34] Tran, D., Toulis, P. and Airoldi, E. (2015). Stochastic gradient descent methods for estimation with large data sets. arXiv preprint arXiv:1509.06459. [35] Vogel, P. S. (February 10, 2014). Will Data Localization Kill the Internet? eCommerce Times. https://www.ecommercetimes.com/story/79946.html. [36] Wang, H., Yang, M. and Stufken, J. (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association 114:525 393-405. · Zbl 1478.62196 [37] Wang, H., Zhu, R. and Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association 113 829-844. · Zbl 1398.62196 [38] Wang, C., Chen, M. H., Schifano, E., Wu, J. and Yan, J. (2016). Statistical methods and computing for big data. Statistics and Its Interface 9 399-414. · Zbl 1405.62004 [39] Xie, M., Singh, K. and Strawderman, W. (2011). Confidence Distributions and a Unifying Framework for Meta-Analysis. Journal of the American Statistical Association 106 320-333. · Zbl 1396.62051 [40] Zhao, T., Cheng, G. and Liu, H. (2016). A partially linear framework for massive hetero- geneous data. Annals of Statistics 44 1400-1437. · Zbl 1358.62050 [41] Zoltowski, D. and Pillow, J. W. (2018). Scaling the Poisson GLM to massive neural datasets through polynomial approximations. Advances in Neural Information Processing Systems 3517-3527. This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.