The pigeonhole bootstrap. (English) Zbl 1126.62027

Summary: Recently there has been much interest in data that, in statistical language, may be described as having a large crossed and severely unbalanced random effects structure. Such data sets arise for recommender engines and information retrieval problems. Many large bipartite weighted graphs have this structure too. We would like to assess the stability of algorithms fit to such data. Even for linear statistics, a naive form of bootstrap sampling can be seriously misleading and P. McCullagh [Bernoulli 6, 285–301 (2000; Zbl 0976.62035)] has shown that no bootstrap method is exact. We show that an alternative bootstrap separately resampling rows and columns of the data matrix satisfies a mean consistency property even in heteroscedastic crossed unbalanced random effects models. This alternative does not require the user to fit a crossed random effects model to the data.


62G09 Nonparametric statistical resampling methods
62P99 Applications of statistics


Zbl 0976.62035
Full Text: DOI arXiv Euclid


[1] Alter, O., Brown, P. O. and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and analysis. PNAS 97 10101-10106.
[2] Cochran, W. G. (1977). Sampling Techniques , 3rd ed. Wiley, New York. · Zbl 0353.62011
[3] Cornfield, J. and Tukey, J. W. (1956). Average values of mean squares in factorials. Ann. Math. Statist. 27 907-949. · Zbl 0075.29404
[4] Crossa, J. and Cornelius, P. L. (2002). Linear-bilinear models for the analysis of genotype-environment interaction data. In Quantitative Genetics , Genomics and Plant Breeding in the 21st Century , an International Symposium (M. S. Kang, ed.) 305-322. CAB International, Wallingford UK.
[5] Deerwester, S., Dumais, S., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990). Indexing by latent semantic analysis. J. Soc. Inform. Sci. 41 391-407.
[6] Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) .
[7] Fisher, R. A. and Mackenzie, W. A. (1923). The manurial response of different potato varieties. J. Agricultural Science XIII 311-320.
[8] McCullagh, P. (2000). Resampling and exchangeable arrays. Bernoulli 6 285-301. · Zbl 0976.62035
[9] Searle, S. R., Casella, G. and McCulloch, C. E. (1992). Variance Components . Wiley, New York. · Zbl 0850.62007
[10] Tukey, J. W. (1949). One degree of freedom for non-additivity. Biometrics 5 232-242.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.