×

Bootstrapping data arrays of arbitrary order. (English) Zbl 1454.62131

Summary: In this paper we study a bootstrap strategy for estimating the variance of a mean taken over large multifactor crossed random effects data sets. We apply bootstrap reweighting independently to the levels of each factor, giving each observation the product of independently sampled factor weights. No exact bootstrap exists for this problem [P. McCullagh, Bernoulli 6, No. 2, 285–301 (2000; Zbl 0976.62035)]. We show that the proposed bootstrap is mildly conservative, meaning biased toward overestimating the variance, under sufficient conditions that allow very unbalanced and heteroscedastic inputs. Earlier results for a resampling bootstrap only apply to two factors and use multinomial weights that are poorly suited to online computation. The proposed reweighting approach can be implemented in parallel and online settings. The results for this method apply to any number of factors. The method is illustrated using a 3 factor data set of comment lengths from Facebook.

MSC:

62G09 Nonparametric statistical resampling methods
62D05 Sampling theory, sample surveys

Citations:

Zbl 0976.62035

Software:

Hive

References:

[1] Bennett, J. and Lanning, S. (2007). The Netflix prize. In Proceedings of KDD Cup and Workshop 2007 35. ACM, New York.
[2] Brennan, R. L., Harris, D. J. and Hanson, B. A. (1987). The bootstrap and other procedures for examining the variability of estimated variance components. Technical report, ACT.
[3] Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist. 7 1-26. · Zbl 0406.62024 · doi:10.1214/aos/1176344552
[4] Hall, P. (1992). The Bootstrap and Edgeworth Expansion . Springer, New York. · Zbl 0744.62026
[5] Lee, H. K. H. and Clyde, M. A. (2004). Lossless online Bayesian bagging. J. Mach. Learn. Res. 5 143-151.
[6] Mammen, E. (1992). When Does Bootstrap Work. Lecture Notes in Statistics 77 . Springer, New York. · Zbl 0760.62038
[7] Mammen, E. (1993). Bootstrap and wild bootstrap for high-dimensional linear models. Ann. Statist. 21 255-285. · Zbl 0771.62032 · doi:10.1214/aos/1176349025
[8] McCarthy, P. J. (1969). Pseudo-replication: Half samples. Review of the International Statistical Institute 37 239-264. · Zbl 0186.53001
[9] McCullagh, P. (2000). Resampling and exchangeable arrays. Bernoulli 6 285-301. · Zbl 0976.62035 · doi:10.2307/3318577
[10] Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap. J. Roy. Statist. Soc. Ser. B 56 3-48. · Zbl 0788.62026
[11] Owen, A. B. (2007). The pigeonhole bootstrap. Ann. Appl. Stat. 1 386-411. · Zbl 1126.62027 · doi:10.1214/07-AOAS122
[12] Oza, N. and Russell, S. (2001). Online bagging and boosting. In Artificial Intelligence and Statistics 2001 105-112. Morgan Kaufmann, San Mateo, CA.
[13] Rubin, D. B. (1981). The Bayesian bootstrap. Ann. Statist. 9 130-134. · doi:10.1214/aos/1176345338
[14] Searle, S. R., Casella, G. and McCulloch, C. E. (1992). Variance Components . Wiley, New York. · Zbl 0850.62007
[15] Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P. and Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. In Proceedings of the VLDB Endowment , Vol. 2 1626-1629. VLDB Endowment.
[16] Wiley, E. W. (2001). Bootstrap strategies for variance component estimation: Theoretical and empirical results. Ph.D. thesis, Stanford Univ.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.