## Optimal and fast detection of spatial clusters with scan statistics.(English)Zbl 1183.62076

Summary: We consider the detection of multivariate spatial clusters in the Bernoulli model with $$N$$ locations, where the design distribution has weakly dependent marginals. The locations are scanned with a rectangular window with sides parallel to the axes and with varying sizes and aspect ratios. Multivariate scan statistics pose a statistical problem due to the multiple testing over many scan windows, as well as a computational problem because statistics have to be evaluated on many windows.
This paper introduces a methodology that leads to both statistically optimal inference and computationally efficient algorithms. The main difference to the traditional calibration of scan statistics is the concept of grouping scan windows according to their sizes, and then applying different critical values to different groups. It is shown that this calibration of the scan statistic results in optimal inference for spatial clusters on both small scales and on large scales, as well as in the case where the cluster lives on one of the marginals. A methodology is introduced that allows for an efficient approximation of the set of all rectangles while still guaranteeing the statistical optimality results described above. It is shown that the resulting scan statistic has a computational complexity that is almost linear in $$N$$.

### MSC:

 62G10 Nonparametric hypothesis testing 62H30 Classification and discrimination; cluster analysis (statistical aspects) 65C60 Computational problems in statistics (MSC2010) 62H15 Hypothesis testing in multivariate analysis 62H11 Directional data; spatial statistics
Full Text:

### References:

 [1] Alm, S. E. (1997). On the distribution of the scan statistic of a two-dimensional Poisson process. Adv. in Appl. Probab. 29 1-16. JSTOR: · Zbl 0883.60046 [2] Anderson, N. H. and Titterington, D. M. (1997). Some methods for investigating spatial clustering, with epidemiological applications. J. Roy. Statist. Soc. Ser. A 160 87-105. [3] Chen, J. and Glaz, J. (1996). Two-dimensional discrete scan statistics. Statist. Probab. Lett. 31 59-68. · Zbl 0880.62014 [4] Derbeko, P., El-Yaniv, R. and Meir, R. (2004). Explicit learning curves for transduction and applications to clustering and compression algorithms. J. Artificial Intelligence Res. 22 117-142. · Zbl 1148.68430 [5] Dümbgen, L. and Spokoiny, V. G. (2001). Multiscale testing of qualitative hypotheses. Ann. Statist. 29 124-152. · Zbl 1029.62070 [6] Dümbgen, L. and Walther, G. (2008). Multiscale inference about a density. Ann. Statist. 36 1758-1785. · Zbl 1142.62336 [7] Feller, W. (1968). An Introduction to Probability Theory and Its Applications , 3rd ed. I . Wiley, New York. · Zbl 0155.23101 [8] Glaz, J. and Balakrishnan, N. (eds.) (1999). Scan Statistics and Applications . Birkhäuser, Boston. · Zbl 0919.00015 [9] Glaz, J., Naus, J. and Wallenstein, S. (2001). Scan Statistics . Springer, New York. · Zbl 0983.62075 [10] Hoeffding, W. (1963). Probability inequalities for sums of bounded random. J. Amer. Statist. Assoc. 58 13-30. JSTOR: · Zbl 0127.10602 [11] Kulldorff, M. (1997). A spatial scan statistic. Comm. Statist. Theory Methods 26 1481-1496. · Zbl 0920.62116 [12] Kulldorff, M. (1999). Spatial scan statistics: Models, calculations, and applications. In: Scan Statistics and Applications (J. Glaz and N. Balakrishnan, eds.). Birkhäuser, Boston. · Zbl 0941.62063 [13] Lepski, O. V. and Tsybakov, A. B. (2000). Asymptotically exact nonparametric hypothesis testing in sup-norm and at a fixed point. Probab. Theory Related Fields 117 17-48. · Zbl 0971.62022 [14] Loader, C. R. (1991). Large-deviation approximations to the distribution of scan statistics. Adv. in Appl. Probab. 23 751-771. JSTOR: · Zbl 0741.60036 [15] Naiman, D. Q. and Priebe, C. E. (2001). Computing scan statistic p -values using importance sampling, with applications to genetics and medical image analysis. J. Comput. Graph. Statist. 26 296-328. JSTOR: · Zbl 04567025 [16] Naus, J. (1965). Clustering of random points in two dimensions. Biometrika 52 263-267. JSTOR: · Zbl 0132.39702 [17] Neill, D. and Moore, A. (2004a). A fast multi-resolution method for detection of significant spatial disease clusters. Adv. Neural Inf. Process. Syst. 10 651-658. [18] Neill, D. and Moore, A. (2004b). Rapid detection of significant spatial disease clusters. In Proc. Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 256-265. ACM, New York. [19] Roederer, M. and Hardy, R. R. (2001). Frequency difference gating: A multivariate method for identifying subsets that differ between samples. Cytometry 45 56-64. [20] Roederer, M., Moore, W., Treister, A., Hardy, R. R. and Herzenberg, L. (2001). Probability binning comparison: A metric for quantitating multivariate distribution differences. Cytometry 45 47-55. [21] Rohde, A. (2009). Sharp-optimal adjustment for multiple testing in the multivariate two-sample problem. Manuscript. Preprint No. 1356, Weirstrass Institute, Berlin, Germany. [22] Rufibach, K. and Walther, G. (2009). The block criterion for multiscale inference about a density, with applications to other multiscale problems. J. Comput. Graph. Statist.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.