×

Significant motifs in time series. (English) Zbl 07260311

Summary: Time series motif discovery is the task of extracting previously unknown recurrent patterns from time series data. It is an important problem within applications that range from finance to health. Many algorithms have been proposed for the task of efficiently finding motifs. Surprisingly, most of these proposals do not focus on how to evaluate the discovered motifs. They are typically evaluated by human experts. This is unfeasible even for moderately sized datasets, since the number of discovered motifs tends to be prohibitively large. Statistical significance tests are widely used in the data mining communities to evaluate extracted patterns. In this work we present an approach to calculate time series motifs statistical significance. Our proposal leverages work from the bioinformatics community by using a symbolic definition of time series motifs to derive each motif’s p-value. We estimate the expected frequency of a motif by using Markov Chain models. The p-value is then assessed by comparing the actual frequency to the estimated one using statistical hypothesis tests. Our contribution gives means to the application of a powerful technique – statistical tests – to a time series setting. This provides researchers and practitioners with an important tool to evaluate automatically the degree of relevance of each extracted motif.

MSC:

62-XX Statistics
68-XX Computer science

Software:

iSAX
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] P. Ferreira, P. Azevedo, C. Silva, and R. Brito, Mining approximate motifs in time series, in Discovery Science, Secaucus, New Jersey, Springer, 2006, 89-101.
[2] J. Lin, E. Keogh, S. Lonardi, and P. Patel, Finding motifs in time series, In Proc. of the 2nd Workshop on Temporal Data Mining, Citeseer, 2002, 53-68.
[3] B. Chiu, E. Keogh, and S. Lonardi, Probabilistic discovery of time series motifs, In Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2003, 498.
[4] Y. Tanaka, K. Iwamoto, and K. Uehara, Discovery of timeseries motif from multi-dimensional data based on mdl principle, Mach Learn 58(2) (2005), 269-300. · Zbl 1075.62084
[5] T. Oates, PERUSE: an unsupervised algorithm for finding recurring patterns in time series, IEEE ICDM 2 (2002), 5.
[6] D. Yankov, E. Keogh, J. Medina, B. Chiu, and V. Zordan, Detecting time series motifs under uniform scaling, In Proceedings of the 13th ACM SIGKDD international
[7] D. Minnen, T. Starner, I. Essa, and C. Isbell, Improving activity discovery with automatic neighborhood estimation, Proceedings of the 20th international joint conference on Artifical intelligence, San Francisco, California, Morgan Kaufmann Publishers Inc., 2007, 2814-2819.
[8] F. M¨orchen and A. Ultsch, Efficient mining of understandable patterns from multivariate interval time series, Data Min Knowl Discov 15(2) (2007), 181-215.
[9] A. Mueen, E. Keogh, Q. Zhu, S. Cash, and B. Westover, Exact discovery of time series motifs, In Proceedings of the Ninth SIAM International Conference on Data Mining (SDM), 2009, 473-484.
[10] A. Mueen and E. Keogh, Online discovery and maintenance of time series motifs, In Proceedings of the sixteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, , ACM, 2010, 1089-1098.
[11] A. Mueen, E. Keogh, and N. Bigdely-Shamlo, Finding time series motifs in disk-resident data, In 2009 Ninth IEEE International Conference on Data Mining, 2009, 367-376.
[12] N. Castro and P. Azevedo, Multiresolution motif discovery in time series, In Proceedings of the Tenth SIAM International Conference on Data Mining, 2010, 665-676.
[13] E. Keogh and T. Folias, The UCR Time Series Data Mining Archive, Riverside CA, University of California-Computer Science & Engineering Department, 2002.
[14] S. Robin, S. Schbath, and V. Vandewalle, Statistical tests to compare motif count exceptionalities, BMC Bioinformatics 8(1) 2007, 84.
[15] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, Network motifs: simple building blocks of complex networks, Science 298(5594) (2002), 824.
[16] G. Webb, Discovering significant patterns, Mach Learn 68(1) (2007), 1-33. · Zbl 1470.68195
[17] P. G. Ferreira and P. J. Azevedo, Evaluating protein motif significance measures: a case study on prosite patterns, In IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007), 2007, 171-178.
[18] J. Zhang, B. Jiang, M. Li, J. Tromp, X. Zhang, and M. Zhang, Computing exact P-values for DNA motifs, Bioinformatics 23(5) (2007), 531.
[19] T. Marschall and S. Rahmann, Efficient exact motif discovery, Bioinformatics 25(12) 2009, i356.
[20] G. Nuel, Effective p-value computations using Finite Markov Chain Imbedding (FMCI): application to local score and to pattern statistics, Algorithms Mol Biol 1(1) 2006, 5.
[21] V. Boeva, J. Cl´ement, M. R´egnier, M. Roytberg, and V. Makeev, Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules, Algorithms Mol Biol 2(1) (2007), 13.
[22] C. Low Kam, A. Mas, and M. Teisseire, Mining for unexpected sequential patterns given a Markov model, 2008. http://www.math.univ-montp2.fr/∼mas/lmt_siam09.pdf.
[23] J. Hollunder, M. Friedel, A. Beyer, C. Workman, and T. Wilhelm, DASS: efficient discovery and p-value calculation of substructures in unordered data, Bioinformatics 23(1) (2007), 77.
[24] S. Robin and S. Schbath, Numerical comparison of several approximations of the word count distribution in random sequences, J Comput Biol 8(4) (2001), 349-359.
[25] M. R´egnier and M. Vandenbogaert, Comparison of statistical significance criteria, J Bioinformatics Comput Biol 4(2) (2006), 537-552.
[26] S. Schbath, An overview on the distribution of word counts in Markov chains, J Comput Biol 7(1-2) (2000), 193-201.
[27] H. He and A. Singh, Graphrank: statistical modeling and mining of significant subgraphs in the feature space, In Sixth International Conference on Data Mining (ICDM’06), 2006, 885-890.
[28] S. Jacquemont, F. Jacquenet, and M. Sebban, Mining probabilistic automata: a statistical view of sequential pattern mining, Mach Learn 75(1) (2009), 91-127. · Zbl 1470.68119
[29] P. Ribeca and E. Raineri, Faster exact Markovian probability functions for motif occurrences: a DFA-only approach, Bioinformatics 24(24) (2008), 2839.
[30] C. Matias, S. Schbath, E. Birmel´e, J. Daudin, and S. Robin, Network motifs: mean and variance for the count, REVSTAT Stat J 4(1) (2006), 31-51. · Zbl 1158.62346
[31] F. Picard, J. Daudin, M. Koskas, S. Schbath, and S. Robin, Assessing the exceptionality of network motifs, J Comput Biol 15(1) (2008), 1-20.
[32] E. Keogh, S. Lonardi, and B. Chiu, Finding surprising patterns in a time series database in linear time and space, In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, 550-556.
[33] E. Keogh and S. Kasetty, On the need for time series data mining benchmarks: a survey and empirical demonstration, Data Min Knowl Discov 7(4) (2003), 349-371.
[34] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh, Querying and mining of time series data: experimental comparison of representations and distance measures, Proc VLDB Endowment 1(2) (2008), 1542-1552.
[35] J. Shieh and E. Keogh, iSAX: indexing and mining terabyte sized time series, In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, 623-631.
[36] S. Schbath, Statistics of motifs, Atelier de Form 1502 (2006).
[37] S. Robin, F. Rodolphe, and S. Schbath, DNA, Words and Models, New York, Cambridge Univ. Press, 2005. · Zbl 1185.92047
[38] S. Holm, A simple sequentially rejective multiple test procedure, Scand J Stat 6(2) (1979), 65-70. · Zbl 0402.62058
[39] S. Hanhij¨arvi, K. Puolam¨aki, and G. Garriga, Multiple hypothesis testing in pattern discovery, STAT 1050 (2009), 29.
[40] J. Storey and R. Tibshirani, Statistical significance for genomewide studies, Proc Natl Acad Sci USA 100(16) (2003), 9440. · Zbl 1130.62385
[41] S. Hanhij¨arvi, K. Puolam¨aki, and G. Garriga, Multiple hypothesis testing in pattern discovery, Arxiv preprint arXiv:0906.5263, 2009.
[42] Y. Benjamini and Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J R Stat Soc B 57(1) (1995), 289-300. · Zbl 0809.62014
[43] Y. Benjamini and M. Leshno, Statistical methods for data mining, Data Mining and Knowledge Discovery Handbook, Tel-Aviv, Israel, Springer, 2005, 565-587.
[44] H. Zhang, B. Padmanabhan, and A. Tuzhilin, On the discovery of significant statistical quantitative rules, In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2004, 374-383.
[45] S. Santosh Bangalore, J. Wang, and D. Allison, How accurate are the extremely small p-values used in genomic research: an evaluation of numerical libraries, Comput Stat Data Anal 53(7) (2009), 2446-2452. · Zbl 1279.62022
[46] E. Keogh, J. Lin, and A. Fu, HOT SAX: efficiently finding the most unusual time series subsequence, In Proceedings of the Fifth IEEE International Conference on Data Mining, IEEE Computer Society, 2005, 233.
[47] N. Castro, Time series motifs statistical significance website. http://www.di.uminho.pt/∼castro/stat.
[48] Y.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.