Scan statistic tail probability assessment based on process covariance and window size. (English) Zbl 1349.62168

Summary: A scan statistic is examined for the purpose of testing the existence of a global peak in a random process with dependent variables of any distribution. The scan statistic tail probability is obtained based on the covariance of the moving sums process, thereby accounting for the spatial nature of the data as well as the size of the searching window. Exact formulas linking this covariance to the window size and the correlation coefficient are developed under general, common and auto-covariance structures of the variables in the original process. The implementation and applicability of the formulas are demonstrated on multiple processes of t-statistics, treating also the case of unknown covariance. A sensitivity analysis provides further insight into the variant interaction of the tail probability with the influence parameters. An R code for the tail probability computation and the data analysis is offered within the supplementary material.


62G32 Statistics of extreme values; tail inference
62J15 Paired and multiple comparisons; multiple testing
Full Text: DOI arXiv


[1] Adak, S, Time-dependent spectral analysis of nonstationary time series, J Am Stat Assoc, 93, 1488-1501, (1998) · Zbl 1064.62565
[2] Adler RJ, Taylor JE (2007) Random fields and geometry. Springer Monographs in Mathematics, Springer, New York · Zbl 1149.60003
[3] Amarioarei, A; Preda, C, Approximations for two-dimensional discrete scan statistics in some block-factor type dependent models, J Stat Plan Infer, 151-152, 107-120, (2014) · Zbl 1288.62018
[4] Amos, DE; Bulgren, WG, Computation of a multivariate F distribution, Math Comput, 26, 255-264, (1972) · Zbl 0252.65015
[5] Bates D, Maechler M (2010) Matrix: sparse and dense matrix classes and methods. R package version 0.999375-46. Retrieved from http://CRAN.R-project.org/package=Matrix · Zbl 1246.62173
[6] Benjamini, Y; Hochberg, Y, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J R Stat Soc Ser B, 57, 289-300, (1995) · Zbl 0809.62014
[7] Benjamini, Y; Hochberg, Y, Multiple hypothesis testing with weights, Scand J Stat, 24, 407-418, (1997) · Zbl 1090.62548
[8] Bouaynaya, N; Schonfeld, D, Non-stationary analysis of coding and non-coding regions in nucleotide sequences, IEEE J Selected Topics Signal Process, 2, 357-364, (2008)
[9] Chan, H; Zhang, N, Scan statistics with weighted observations, J Am Stat Assoc, 102, 595-602, (2007) · Zbl 1172.62322
[10] Chen, H; Xing, H; Zhang, NR, Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays, PLoS Comput Biolz, 7, e1001060, (2011)
[11] Chen J (1998) Approximations and inequalities for discrete scan statistics. unpublished Ph.D. Dissertation, University of Connecticut, Storrs, CT · Zbl 0646.60032
[12] Cheng, SH; Higham, N, A modified Cholesky algorithm based on a symmetric indefinite factorization, SIAM J Matrix Anal Appl, 19, 1097-1110, (1998) · Zbl 0949.65022
[13] Conneely, KN; Boehnke, M, So many correlated tests, so little time! rapid adjustment of P values for multiple correlated tests., Am J Hum Genet, 81, 1158-1168, (2007)
[14] Darling, RW; Waterman, M, Extreme value distribution for the largest cube in a random lattice, SIAM J Appl Math, 46, 118-132, (1986) · Zbl 0658.60063
[15] David, L; Huber, W; Granovskaia, M; Toedling, J; Palm, CJ; Bofkin, L; Jones, T; Davis, RW; Steinmetz, LM, A high-resolution map of transcription in the yeast genome, Proc Natl Acad Sci, 103, 5320-5325, (2006)
[16] Efron, B, Correlation and large-scale simultaneous significance testing, J Am Stat Assoc, 102, 93-103, (2007) · Zbl 1284.62340
[17] Efron, B, Correlated Z-values and the accuracy of large-scale statistical estimates, J Am Stat Assoc, 105, 1042-1055, (2010) · Zbl 1390.62139
[18] Genovese, CR; Wasserman, L, A stochastic process approach to false discovery control, Ann Stat, 32, 1035-1061, (2004) · Zbl 1092.62065
[19] Genovese, CR; Roeder, K; Wasserman, L, False discovery control with P-value weighting, Biometrika, 93, 509-524, (2006) · Zbl 1108.62070
[20] Genz, A, Numerical computation of multivariate normal probabilities, J Comput Graph Stat, 1, 141-150, (1992)
[21] Genz, A, Comparison of methods for the computation of multivariate normal probabilities, Computing Science and Statistics, 25, 400-405, (1993)
[22] Genz A, Bretz F (2009) Computation of multivariate normal and t probabilities, vol 195. Springer-Verlag, Heidelberg · Zbl 1204.62088
[23] Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T (2014) mvtnorm: multivariate normal and t distributions. R package version 0.9-9996. http://CRAN.R-project.org/package=mvtnorm
[24] Glaz J, Balakrishnan N (eds) (1999) Scan statistics and applications. Boston, Birkhäuser · Zbl 0919.00015
[25] Glaz, J; Naus, J, Tight bounds and approximations for scan statistic probabilities for discrete data, Ann Appl Probab, 1, 306-318, (1991) · Zbl 0738.60039
[26] Glaz J, Naus J, Wallenstein S (2001) Scan statistics. Springer-Verlag, New York · Zbl 0983.62075
[27] Glaz, J; Naus, J; Wang, X, Approximations and inequalities for moving sums, Methodol Comput Appl Probab, 14, 597-616, (2011) · Zbl 1277.60039
[28] Glaz, J; Naus, J; Wang, X, Approximations and inequalities for moving sums, Methodol Comput Appl Probab, 14, 597-616, (2012) · Zbl 1277.60039
[29] Goldstein, L; Waterman, M, Poisson, compound Poisson and process approximations for testing statistical significance in sequence comparisons, Bull Math Biol, 54, 785-812, (1992) · Zbl 0769.92019
[30] Haiman, G; Preda, C, One dimensional scan statistics generated by some dependent stationary sequences, Statisitcs and Probability Letters, 83, 1457-1463, (2013) · Zbl 1287.60089
[31] Higham, N, Computing the nearest correlation matrix - a problem from finance, IMA J Numer Anal, 22, 329-343, (2002) · Zbl 1006.65036
[32] Hoh J, Ott J (2000) Scan statistics to scan markers for susceptibility genes. Proc Natl Acad Sci:120-130 · Zbl 1390.62139
[33] Huang, L; Tiwari, CT; Zou, Z; Kulldorff, M; Feuer, EJ, Weighted normal spatial scan statistic for heterogeneous population data, J Am Stat Assoc, 104, 886-898, (2009) · Zbl 1388.62186
[34] Huber, W; Toedling, J; Steinmetz, L, Transcript mapping with high-density oligonucleotide tiling arrays, Bioinformatics, 22, 1963-1970, (2006)
[35] Juneau, K; Palm, C; Miranda, M; Davis, RW, High-density yeast-tiling array reveals previously undiscovered introns and extensive regulation of meiotic splicing, Proc Natl Acad Sci, 104, 1522-1527, (2007)
[36] Karlin, S; Brendel, V, Chance and statistical significance in protein and DNA sequence analysis, Science, 257, 39-49, (1992)
[37] Karlin, S; Dembo, A, Limit-distribution of maximal segmental score among Markov-dependent partial sums, Adv Appl Probab, 24, 113-140, (1992) · Zbl 0767.60017
[38] Keles, S; Van der Laan, MJ; Dudoit, S; Cawley, S, Multiple testing methods for chip-chip high density oligonucleotide array data, J Comput Biol, 13, 579-613, (2006)
[39] Koutras MV; Alexandrou VA, Runs, scans and URN model distributions: a unified Markov chain approach., Ann Inst Stat Math, 47, 743-766, (1995) · Zbl 0848.60021
[40] Ledoit, O; Wolf, M, Improved estimation of the covariance matrix of stock returns with an application to portfolio selection, Journal of Empirical Finance, 10, 603-621, (2003)
[41] Lin, DY, An efficient Monte Carlo approach to assessing statistical significance in genomic studies, Bioinformatics, 21, 781-787, (2005)
[42] Lindgren G, Leadbetter MR, Rootzen H (1983) Extremes and related properties of stationary sequences and processes. Springer-Verlag, New York · Zbl 0518.60021
[43] Loader, CR, Large-deviation approximations to the distribution of scan statistics, Adv Appl Probab, 23, 751-771, (1991) · Zbl 0741.60036
[44] Mourier, T; Jeffares, DC, Eukaryotic intron loss, Science, 300, 1393—1393, (2003)
[45] Naus, J, Probabilities for a generalized birthday problem, J Am Stat Assoc, 69, 810-815, (1974) · Zbl 0292.60032
[46] Naus, J, Approximations for distributions of scan statistics, J Am Stat Assoc, 77, 177-183, (1982) · Zbl 0482.62010
[47] Perone-Pacifico, M; Genovese, C; Verdinelli, I; Wasserman, L, False discovery control for random fields, J Am Soc Stat Assoc, 99, 1002-1014, (2004) · Zbl 1055.62105
[48] R Development Core Team (2011) R: A language and environment for statistical computing. Foundation for statistical computing, ISBN 3-900051-07-0. Vienna, Austria. Retrieved from http://www.R-project.org/ · Zbl 1172.62322
[49] Reiner, A; Yekutieli, D; Benjamini, Y, Identifying differentially expressed genes using false discovery rate controlling procedures, Bioinformatics, 19, 368-375, (2003)
[50] Reiner-Benaim, A; Davis, WR; Juneau, K, Scan statistics analysis for detection of introns in time-course tiling array data, Stat Appl Genet Mol Biol, 13, 173-90, (2014) · Zbl 1296.92068
[51] Reiner-Benaim, A; Yekutieli, D; Letwin, N; Elmer, G; Lee, N; Kafkafi, N; Benjamini, Y, Associating quantitative behavioral traits with gene expression in the brain: searching for diamonds in the hay, Bioinformatics, 23, 2239-2246, (2007)
[52] Rice, SO, Mathematical analysis of random noise, Bell System Technical Journal, 24, 46-156, (1945) · Zbl 0063.06487
[53] Roeder, K; Devlin, B; Wasserman, L, Improving power in genome-wide association studies: weights tip the scale, Genet Epidemiol, 31, 741-747, (2007)
[54] Schäfer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4, Article 32 · Zbl 1064.62565
[55] Schäfer J, Opgen-Rhein R, Zuber V, Ahdesmaki M, Pedro Duarte Silva A, Strimmer K (2013) corpcor: efficient estimation of covariance and (Partial) correlation. R package version 1.6.6. http://strimmerlab.org/software/corpcor/ · Zbl 1090.62548
[56] Schwartzman, A; Gavrilov, Y; Adler, R, Multiple testing of local maxima for detection of peaks in 1D, Ann Stat, 39, 3290-3319, (2011) · Zbl 1246.62173
[57] Seaman, SR; Müller-Myhsok, B, Rapid simulation of P values for product methods and multiple-testing adjustment in association studies, Am J Hum Genet, 76, 399-408, (2005)
[58] Siegmund, D, Approximate tail probabilities for the maxima of some random fields, Ann Probab, 16, 487-501, (1988) · Zbl 0646.60032
[59] Siegmund, D; Kim, H, The likelihood ratio test for a change-point in simple linear regression, Biometrika, 76, 409-423, (1989) · Zbl 0676.62027
[60] Siegmund, DO; Zhang, NR; Yakir, B, False discovery rate for scanning statistics. , Biometrika, 98, 979-985, (2011) · Zbl 1228.62090
[61] Taylor, JE; Worsley, KJ, Detecting sparse signal in random fields, with an application to brain mapping, J Am Stat Assoc, 102, 913-928, (2007) · Zbl 1469.62353
[62] Woodroofe, M, Frequentist properties of Bayesian sequential tests, Biometrika, 63, 101-110, (1976) · Zbl 0341.62067
[63] Yekutieli, D; Reiner-Benaim, A; Benjamini, Y; Elmer, GI; Kafkafi, N; Letwin, NE; Lee, NH, Approaches to multiplicity issues in complex research in microarray analysis, Statistica Neerlandica, 60, 414-437, (2006) · Zbl 1108.62123
[64] Zelinski, JS; Bouaynaya, N; Schonfeld, D; O’Neill, W, Time-dependent ARMA modeling of genomic sequences, BMC Bioinforma, 9, s14, (2008) · Zbl 1382.76254
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.