×

Mining approximate patterns with frequent locally optimal occurrences. (English) Zbl 1329.05017

Summary: We consider a frequent approximate pattern mining problem, in which interspersed repetitive regions are extracted from a given string. That is, we enumerate substrings that frequently match substrings of a given string locally and optimally. For this problem, we propose a new algorithm, in which candidate patterns are generated without duplication using the suffix tree of a given string. We further define a \(k\)-gap-constrained setting, in which the number of gaps in the alignment between a pattern and an occurrence is limited to at most \(k\). Under this setting, we present memory-efficient algorithms, particularly a candidate-based version, which runs fast enough even over human chromosome sequences with more than 10 million nucleotides. We note that our problem and algorithms for strings can be directly extended to ordered labeled trees. In our experiments we used both randomly synthesized strings, in which corrupted similar substrings are embedded, and real data of human chromosome. The synthetic data experiments show that our proposed approach extracted embedded patterns correctly and time-efficiently. In real data experiments, we examined the centers of 100 clusters computed after grouping the patterns obtained by our \(k\)-gap-constrained versions (\(k = 0, 1 \mathrm{ and } 2\)) and the results revealed that the regions of their occurrences coincided with around a half of the regions automatically annotated as Alu sequences by a manually curated repeat sequence database.

MSC:

05A15 Exact enumeration problems, generating functions
68T10 Pattern recognition, speech recognition
PDFBibTeX XMLCite
Full Text: DOI

References:

[2] Altschul, S.; Gish, V.; Miller, W.; Myers, E.; Lipman, D., Basic local alignment search tool, J. Mol. Biol., 215, 403-410 (1990)
[3] Becher, V.; Deymonnaz, A.; Heiber, P., Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome, Bioinfomatics, 25, 14, 1746-1753 (2009)
[4] Bergman, C. M.; Quesneville, H., Discovering and detecting transposable elements in genome sequences, Brief Bioinform., 8, 6, 382-392 (2007)
[5] Bao, Z.; Eddy, S. R., Automated de novo identification of repeat sequence families in sequenced genomes, Genome Res., 12, 8, 1269-1276 (2002)
[6] Cheng, J.; Ke, Y.; Ng, W., \( \delta \)-tolerance closed frequent itemsets, (Proceedings of the 6th IEEE International Conference on Data Mining, ICDM06 (2006), IEEE Press), 139-148
[7] Demaine, E. D.; Mozes, S.; Rossman, B.; Weimann, O., An optimal decomposition algorithm for tree edit distance, ACM Trans. Algorithms, 6, 1 (2009), Article No. 2 · Zbl 1300.68057
[8] Downey, P. J.; Sethi, R.; Tarjan, R. E., Variations on the common subexpression problem, J. ACM, 27, 4, 758-771 (1980) · Zbl 0458.68026
[9] Edgar, R. C.; Myers, E. W., Piler: identification and classification of genomic repeats, Bioinformatics, 21, Suppl 1, i152-i158 (2005)
[10] Erickson, B. W.; Sellers, P. H., Recognition of patterns in genetic sequences, (Sankoff, D.; Kruskal, J. B., Time Warps, String Edits, and Macromolecules : The Theory and Practice of Sequence Comparison (1983), Addison-Wesley: Addison-Wesley Reading, MA), 55-91, (Chapter 2)
[11] Floratou, A.; Tata, S.; Patel, J. M., Efficient and accurate discovery of patterns in sequence data sets, IEEE Trans. Knowl. Data Eng., 23, 1154-1168 (2011)
[12] Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology (1997), Cambridge University Press: Cambridge University Press New York · Zbl 0934.68103
[13] Gusfield, D.; Stoye, J., Linear time algorithms for finding and representing all the tandem repeats in a string, J. Comput. System Sci., 69, 525-546 (2004) · Zbl 1076.68111
[14] Jurka, J.; Kapitonov, V.; Pavlicek, A.; Klonowski, P.; Kohany, O.; Walichiewicz, J., Repbase update, a database of eukaryotic repetitive elements, Cytogenet. Genome Res., 110, 462-467 (2005)
[15] Kurtz, S.; Choudhuri, J. V.; Ohlebusch, E.; Schleiermacher, C.; Stoye, J.; Giegerich, R., Reputer: the manifold applications of repeat analysis on a genomic scale, Nucleic Acids Res., 29, 22, 4633-4642 (2001)
[16] Li, C.; Yang, Q.; Wang, J.; Li, M., Efficient mining of gap-constrained subsequences and its various applications, ACM Trans.Knowl. Discov. Data, 6 (2012), Article No. 2
[17] Li, R.; Ye, J.; Li, S.; Wang, J.; Han, Y.; Ye, C.; Wang, J.; Yang, H.; Yu, J.; Wong, G. K.-S.; Wang, J., Reas: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun, PLoS Comput. Biol., 1, 4, e43 (2005)
[18] Luxburg, U. V., A Tutorial on Spectral Clustering (2007)
[19] Ng, A. Y.; Jordan, M. I.; Weiss, Y., On spectral clustering: Analysis and an algorithm, (Advances in Neural Information Processing Systems (2001), MIT Press), 849-856
[21] Price, A. L.; Jones, N. C.; Pevzner, P. A., De novo identification of repeat families in large genomes, Bioinformatics, 21, Suppl 1, i351-i358 (2005)
[22] Quesneville, H.; Nouaud, D.; Anxolabéhère, D., Detection of new transposable element families in drosophila melanogaster and anopheles gambiae genomes, J. Mol. Evol., 57, 1, S50-S59 (2003)
[23] Schmidt, J. P., All highest scoring paths in weighted grid graphs and their application to finding all approximate repeats in strings, SIAM J. Comput., 27, 4, 972-992 (1998) · Zbl 0907.68076
[25] Tai, K.-C., The tree-to-tree correction problem, J. Assoc. Comput. Mach., 26, 3, 422-433 (1979) · Zbl 0409.68040
[26] Takigawa, I.; Mamitsuka, H., Efficiently mining \(\delta \)-tolerance closed frequent subgraphs, Mach. Learn., 82, 2, 95-121 (2011) · Zbl 1237.68168
[27] Tan, P.-N.; Steinbach, M.; Kumar, V., Introduction to Data Mining (2005), Addison-Wesley Longman Publishing Co., Inc.: Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA
[28] Volfovsky, N.; Haas, B. J.; Salzberg, S. L., A clustering method for repeat analysis in dna sequences, Genome Biol., 2, 8 (2001), research0027.1-research0027.11
[29] Wang, J.; Han, J.; Li, C., Frequent closed sequence mining without candidate maintenance, IEEE Trans. Knowl. Data Eng., 19, 1042-1056 (2007)
[32] Zaki, M. J., Spade: An effcient algorithm for mining frequent sequences, Mach. Learn., 42, 31-60 (2001) · Zbl 0970.68052
[33] Zhang, K., Algorithms for the constrained editing distance between ordered labeled trees and related problems, Pattern Recognit., 28, 3, 463-474 (1995)
[34] Zhang, M.; Kao, B.; Cheung, D. W.; Yip, K. Y., Mining periodic patterns with gap requirement from sequence, ACM Trans.Knowl. Discov. Data, 1 (2007), Article No. 7
[36] Zhu, F.; Yan, X.; Yu, P. S.; Han, J., Mining frequent approximate sequential patterns, (Kargupta, H.; Han, J.; Yu, P. S.; Motwani, R.; Kumar, V., Next Generation of Data Mining (2008), Chapman and Hall/CRC), 69-89
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.