Graph-based change-point detection. (English) Zbl 1308.62090

Summary: We consider the testing and estimation of change-points – locations where the distribution abruptly changes – in a data sequence. A new approach, based on scan statistics utilizing graphs representing the similarity between observations, is proposed. The graph-based approach is nonparametric, and can be applied to any data set as long as an informative similarity measure on the sample space can be defined. Accurate analytic approximations to the significance of graph-based scan statistics for both the single change-point and the changed interval alternatives are provided. Simulations reveal that the new approach has better power than existing approaches when the dimension of the data is moderate to high. The new approach is illustrated on two applications: The determination of authorship of a classic novel, and the detection of change in a network over time.


62G10 Nonparametric hypothesis testing
62H99 Multivariate analysis


Full Text: DOI arXiv Euclid


[1] Carlstein, E. G., Müller, H. G. and Siegmund, D. (1994). Change-Point Problems. Institute of Mathematical Statistics Lecture Notes-Monograph Series 23 . IMS, Hayward, CA.
[2] Chen, L. H. Y. and Shao, Q. M. (2005). Stein’s method for normal approximation. An Introduction to Stein’s Method 4 1-59.
[3] Chen, H. and Zhang, N. R. (2013). Graph-based tests for two-sample comparisons of categorical data. Statist. Sinica 23 1479-1503. · Zbl 1417.62155
[4] Chen, H. and Zhang, N. (2014). Supplement to “Graph-based change-point detection.” . · Zbl 1308.62090
[5] Cobb, G. W. (1978). The problem of the Nile: Conditional solution to a changepoint problem. Biometrika 65 243-251. · Zbl 0394.62074
[6] Cox, D. R. and Spjøtvoll, E. (1982). On partitioning means into groups. Scand. J. Stat. 9 147-152. · Zbl 0493.62065
[7] Desobry, F., Davy, M. and Doncarli, C. (2005). An online kernel change detection algorithm. IEEE Trans. Signal Process. 53 2961-2974. · Zbl 1370.94317
[8] Eagle, N., Pentland, A. S. and Lazer, D. (2009). Inferring friendship network structure by using mobile phone data. Proc. Natl. Acad. Sci. USA 106 15274-15278.
[9] Friedman, J. H. and Rafsky, L. C. (1979). Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Ann. Statist. 7 697-717. · Zbl 0423.62034
[10] Girón, J., Ginebra, J. and Riba, A. (2005). Bayesian analysis of a multinomial sequence and homogeneity of literary style. Amer. Statist. 59 19-30. · Zbl 05680618
[11] Harchaoui, Z., Moulines, E. and Bach, F. R. (2009). Kernel change-point analysis. In Advances in Neural Information Processing Systems .
[12] James, B., James, K. L. and Siegmund, D. (1987). Tests for a change-point. Biometrika 74 71-83. · Zbl 0632.62021
[13] James, B., James, K. L. and Siegmund, D. (1992). Asymptotic approximations for likelihood ratio tests and confidence regions for a change-point in the mean of a multivariate normal distribution. Statist. Sinica 2 69-90. · Zbl 0822.62048
[14] Kossinets, G. and Watts, D. J. (2006). Empirical analysis of an evolving social network. Science 311 88-90. · Zbl 1226.91055
[15] Lung-Yut-Fong, A., Lévy-Leduc, C. and Cappé, O. (2011). Homogeneity and change-point detection tests for multivariate data using rank statistics. Preprint. Available at . · Zbl 1338.62134
[16] Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557-572. · Zbl 1155.62478
[17] Radovanović, M., Nanopoulos, A. and Ivanović, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11 2487-2531. · Zbl 1242.62056
[18] Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 515-530. · Zbl 1095.62053
[19] Siegmund, D. (1988). Approximate tail probabilities for the maxima of some random fields. Ann. Probab. 16 487-501. · Zbl 0646.60032
[20] Siegmund, D. O. (1992). Tail approximations for maxima of random fields. In Probability Theory ( Singapore , 1989) 147-158. de Gruyter, Berlin. · Zbl 0758.60047
[21] Siegmund, D. and Yakir, B. (2007). The Statistics of Gene Mapping . Springer, New York. · Zbl 1280.62012
[22] Siegmund, D., Yakir, B. and Zhang, N. R. (2011). Detecting simultaneous variant intervals in aligned sequences. Ann. Appl. Stat. 5 645-668. · Zbl 1223.62166
[23] Srivastava, M. S. and Worsley, K. J. (1986). Likelihood ratio tests for a change in the multivariate normal mean. J. Amer. Statist. Assoc. 81 199-204. · Zbl 0589.62037
[24] Tang, H. K. and Siegmund, D. (2001). Mapping quantitative trait loci in oligogenic models. Biostatistics 2 147-162. · Zbl 1097.62570
[25] Tsirigos, A. and Rigoutsos, I. (2005). A new computational method for the detection of horizontal gene transfer events. Nucleic Acids Res. 33 922-933.
[26] Tu, I.-P. and Siegmund, D. (1999). The maximum of a function of a Markov chain and application to linkage analysis. Adv. in Appl. Probab. 31 510-531. · Zbl 0941.60088
[27] Vostrikova, L. J. (1981). Detection of the disorder in multidimensional random-processes. Doklady Akademii Nauk SSSR 259 270-274.
[28] Woodroofe, M. (1976). Frequentist properties of Bayesian sequential tests. Biometrika 63 101-110. · Zbl 0341.62067
[29] Woodroofe, M. (1978). Large deviations of likelihood ratio statistics with applications to sequential testing. Ann. Statist. 6 72-84. · Zbl 0386.62019
[30] Worsley, K. J. (1986). Confidence regions and test for a change-point in a sequence of exponential family random variables. Biometrika 73 91-104. · Zbl 0589.62016
[31] Zhang, N. R., Siegmund, D. O., Ji, H. and Li, J. Z. (2010). Detecting simultaneous changepoints in multiple sequences. Biometrika 97 631-645. · Zbl 1195.62168
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.