##
**Large-scale inference with block structure.**
*(English)*
Zbl 07547941

Summary: The detection of weak and rare effects in large amounts of data arises in a number of modern data analysis problems. Known results show that in this situation the potential of statistical inference is severely limited by the large-scale multiple testing that is inherent in these problems. Here, we show that fundamentally more powerful statistical inference is possible when there is some structure in the signal that can be exploited, for example, if the signal is clustered in many small blocks, as is the case in some relevant applications. We derive the detection boundary in such a situation where we allow both the number of blocks and the block length to grow polynomially with sample size. We derive these results both for the univariate and the multivariate settings as well as for the problem of detecting clusters in a network. These results recover as special cases the sparse signal detection problem (Ann. Statist. 32 (2004) 962–994) where there is no structure in the signal, as well as the scan problem (Statist. Sinica 23 (2013) 409–428) where the signal comprises a single interval. We develop methodology that allows optimal adaptive detection in the general setting, thus exploiting the structure if it is present without incurring a relevant penalty in the case where there is no structure. The advantage of this methodology can be considerable, as in the case of no structure the means need to increase at the rate \(\sqrt{\log n}\) to ensure detection, while the presence of structure allows detection even if the means decrease at a polynomial rate.

### MSC:

62G10 | Nonparametric hypothesis testing |

62G32 | Statistics of extreme values; tail inference |

62H15 | Hypothesis testing in multivariate analysis |

### Keywords:

block structure; heterogeneous mixture detection; sparse signal detection; structured Berk-Jones statistic; structured higher criticism; structured \(\phi\)-divergence; tail bound for higher criticism statistic and Berk-Jones statistic; tail bound for supremum of binomial log likelihood ratio process; tail bound for supremum of standardized Brownian bridge; structured Berk-Jones statistic; tail bound for higher criticism statistic and Berk-Jones statistic
PDFBibTeX
XMLCite

\textit{J. Kou} and \textit{G. Walther}, Ann. Stat. 50, No. 3, 1541--1572 (2022; Zbl 07547941)

### References:

[1] | Arias-Castro, E., Candès, E. J. and Durand, A. (2011). Detection of an anomalous cluster in a network. Ann. Statist. 39 278-304. · Zbl 1209.62097 · doi:10.1214/10-AOS839 |

[2] | Arias-Castro, E., Donoho, D. L. and Huo, X. (2005). Near-optimal detection of geometric objects by fast multiscale methods. IEEE Trans. Inf. Theory 51 2402-2425. · Zbl 1282.94014 · doi:10.1109/TIT.2005.850056 |

[3] | CAI, T. T., JENG, X. J. and JIN, J. (2011). Optimal detection of heterogeneous and heteroscedastic mixtures. J. R. Stat. Soc. Ser. B. Stat. Methodol. 73 629-662. · Zbl 1228.62020 · doi:10.1111/j.1467-9868.2011.00778.x |

[4] | CHAN, H. P. (2009). Detection of spatial clustering with average likelihood ratio test statistics. Ann. Statist. 37 3985-4010. · Zbl 1184.62067 · doi:10.1214/09-AOS701 |

[5] | CHAN, H. P. and WALTHER, G. (2013). Detection with the scan and the average likelihood ratio. Statist. Sinica 23 409-428. · Zbl 1257.62096 |

[6] | DELAIGLE, A. and HALL, P. (2009). Higher criticism in the context of unknown distribution, non-independence and classification. In Perspectives in Mathematical Sciences. I. Stat. Sci. Interdiscip. Res. 7 109-138. World Sci. Publ., Hackensack, NJ. · doi:10.1142/9789814273633_0006 |

[7] | Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962-994. · Zbl 1092.62051 · doi:10.1214/009053604000000265 |

[8] | DUEMBGEN, L. and WELLNER, J. A. (2014). Confidence bands for distribution functions: A new look at the law of the iterated logarithm. Available at arXiv:1402.2918. |

[9] | FAN, Y., JIN, J. and YAO, Z. (2013). Optimal classification in sparse Gaussian graphic model. Ann. Statist. 41 2537-2571. · Zbl 1294.62061 · doi:10.1214/13-AOS1163 |

[10] | GANGNON, R. E. and CLAYTON, M. K. (2001). A weighted average likelihood ratio test for spatial clustering of disease. Stat. Med. 20 2977-2987. |

[11] | Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. Ann. Statist. 38 1686-1732. · Zbl 1189.62080 · doi:10.1214/09-AOS764 |

[12] | INGSTER, YU. I. (1997). Some problems of hypothesis testing leading to infinitely divisible distributions. Math. Methods Statist. 6 47-69. · Zbl 0878.62005 |

[13] | INGSTER, YU. I. (1998). Minimax detection of a signal for \[{l^n}\]-balls. Math. Methods Statist. 7 401-428. · Zbl 1103.62312 |

[14] | INGSTER, Y. I., POUET, C. and TSYBAKOV, A. B. (2009). Classification of sparse high-dimensional vectors. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 4427-4448. · Zbl 1185.62115 · doi:10.1098/rsta.2009.0156 |

[15] | INGSTER, YU. I. and SUSLINA, I. A. (2002). On the detection of a signal with a known shape in a multichannel system. Zap. Nauchn. Sem. S.-Peterburg. Otdel. Mat. Inst. Steklov. (POMI) 294 88-112, 261. · Zbl 1259.94029 · doi:10.1007/s10958-005-0133-z |

[16] | ITÔ, K. and MCKEAN, H. P. JR. (1965). Diffusion Processes and Their Sample Paths. Die Grundlehren der Mathematischen Wissenschaften, Band 125. Academic Press, New York; Springer, Berlin-New York. · Zbl 0127.09503 |

[17] | JAGER, L. and WELLNER, J. A. (2007). Goodness-of-fit tests via phi-divergences. Ann. Statist. 35 2018-2053. · Zbl 1126.62030 · doi:10.1214/0009053607000000244 |

[18] | JENG, X. J., CAI, T. T. and LI, H. (2010). Optimal sparse segment identification with application in copy number variation analysis. J. Amer. Statist. Assoc. 105 1156-1166. · Zbl 1390.62170 · doi:10.1198/jasa.2010.tm10083 |

[19] | KOU, J. (2017). Large-Scale Inference with Block Structure. ProQuest LLC, Ann Arbor, MI. Thesis (Ph.D.)-Stanford University. |

[20] | KOU, J. (2021). Identifying the support of rectangular signals in Gaussian noise. Comm. Statist. Theory Methods 1-28. |

[21] | KULLDORFF, M. (1999). Spatial scan statistics: Models, calculations, and applications. In Scan Statistics and Applications. Stat. Ind. Technol. 303-322. Birkhäuser, Boston, MA. · Zbl 0941.62063 · doi:10.1007/978-1-4612-1578-3_14 |

[22] | LI, J. and SIEGMUND, D. (2015). Higher criticism: \(p\)-values and criticism. Ann. Statist. 43 1323-1350. · Zbl 1320.62039 · doi:10.1214/15-AOS1312 |

[23] | MILLER, R. and SIEGMUND, D. (1982). Maximally selected chi square statistics. Biometrics 38 1011-1016. · Zbl 0502.62091 · doi:10.2307/2529881 |

[24] | RIVERA, C. and WALTHER, G. (2013). Optimal detection of a jump in the intensity of a Poisson process or in a density with likelihood ratio statistics. Scand. J. Stat. 40 752-769. · Zbl 1283.62179 · doi:10.1111/sjos.12027 |

[25] | Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Applications to Statistics. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. Wiley, New York. · Zbl 1170.62365 |

[26] | VERZELEN, N. and ARIAS-CASTRO, E. (2017). Detection and feature selection in sparse mixture models. Ann. Statist. 45 1920-1950. · Zbl 1486.62192 · doi:10.1214/16-AOS1513 |

[27] | WALTHER, G. (2010). Optimal and fast detection of spatial clusters with scan statistics. Ann. Statist. 38 1010-1033. · Zbl 1183.62076 · doi:10.1214/09-AOS732 |

[28] | WALTHER, G. (2013). The average likelihood ratio for large-scale multiple testing and detecting sparse mixtures. In From Probability to Statistics and Back: High-Dimensional Models and Processes. Inst. Math. Stat. (IMS) Collect. 9 317-326. IMS, Beachwood, OH. · Zbl 1356.62095 · doi:10.1214/12-IMSCOLL923 |

[29] | ZHONG, P.-S., CHEN, S. X. and XU, M. (2013). Tests alternative to higher criticism for high-dimensional means under sparsity and column-wise dependence. Ann. Statist. 41 2820-2851 · Zbl 1294.62128 · doi:10.1214/13-AOS1168 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.