×

Simple and flexible detection of contiguous repeats using a suffix tree. (English) Zbl 0988.68058

Summary: We study the problem of detecting all occurrences of (primitive) tandem repeats and tandem arrays in a string. We first give a simple time- and space-optimal algorithm to find all tandem repeats, and then modify it to become a time and space-optimal algorithm for finding only the primitive tandem repeats. Both of these algorithms are then extended to handle tandem arrays. The contribution of this paper is both pedagogical and practical, giving simple algorithms and implementations based on a suffix tree, using only standard tree traversal techniques.

MSC:

68P10 Searching and sorting

Software:

REPuter
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Apostolico, A., The myriad virtues of subword trees, (Apostolico, A.; Galil, Z., Combinatorial Algorithms on Words, vol. F12 of NATO ASI Series (1985), Springer: Springer Berlin), 85-96 · Zbl 0572.68067
[2] Apostolico, A.; Preparata, F. P., Optimal off-line detection of repetitions in a string, Theoret. Comput. Sci., 22, 297-315 (1983) · Zbl 0497.68052
[3] Cao, J. X.; Krell, P. J.; Nagy, E., Sequence and transcriptional analysis of terminal regions of the fowl adenovirus type 8 genome, J. Gen. Virol., 79, Pt10, 2507-2516 (1998)
[4] Crochemore, M., An optimal algorithm for computing the repetitions in a word, Inform. Process. Lett., 12, 5, 244-250 (1981) · Zbl 0467.68075
[5] Crochemore, M.; Rytter, W., Periodic prefixes in texts, (Capocelli, R.; De Santis, A.; Vaccaro, U., Sequences II (1993), Springer: Springer Berlin), 153-165 · Zbl 0960.68755
[6] Crochemore, M.; Rytter, W., Text Algorithms (1994), Oxford University Press: Oxford University Press New York, NY · Zbl 0844.68101
[7] Crochemore, M.; Rytter, W., Squares, cubes, and time-space efficient string searching, Algorithmica, 13, 5, 405-425 (1995) · Zbl 0849.68044
[8] Farach, M., Optimal suffix tree construction with large alphabets, (Proc. 38th Ann. Symp. on Foundations of Computer Science, FOCS 97 (1997), IEEE Press: IEEE Press New York, NY), 137-143
[9] Fraenkel, A. S.; Simpson, J., How many squares can a string contain?, J. Combin. Theory Ser. A, 82, 112-120 (1998) · Zbl 0910.05001
[10] Fraenkel, A. S.; Simpson, J., The exact number of squares in Fibonacci words, Theoret. Comput. Sci., 218, 1, 95-106 (1999) · Zbl 0916.68122
[11] Fraenkel, A. S.; Simpson, R. J., How many squares must a binary sequence contain?, Electron. J. Combin., 2, #R2, 1-9 (1995) · Zbl 0816.11007
[12] Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology (1997), Cambridge University Press: Cambridge University Press New York, NY · Zbl 0934.68103
[13] D. Gusfield, J. Stoye, Linear time algorithms for finding and representing all the tandem repeats in a string, Report CSE-98-4, Department of Computer Science, University of California, Davis, 1998.; D. Gusfield, J. Stoye, Linear time algorithms for finding and representing all the tandem repeats in a string, Report CSE-98-4, Department of Computer Science, University of California, Davis, 1998. · Zbl 1076.68111
[14] Iliopoulos, C. S.; Moore, D.; Smyth, W. F., A characterization of the squares in a Fibonacci string, Theoret. Comput. Sci., 172, 1-2, 281-291 (1997) · Zbl 0903.68050
[15] R.W. Irving, Personal communication.; R.W. Irving, Personal communication.
[16] Kolpakov, R.; Kucherov, G., Finding maximal repetitions in a word in linear time, (Proc. 40th Annu. Symp. on Foundations of Computer Science FOCS 99 (1999), IEEE Press: IEEE Press New York), 596-604
[17] Kolpakov, R.; Kucherov, G., On maximal repetitions in words, (Ciobanu, G.; Paun, G., Proc. 12th Int. Symp. on Fundamentals of Computation Theory, FCT 99, vol. 1684 of Lecture Notes in Computer Science (1999), Springer: Springer Berlin), 374-385 · Zbl 0948.68139
[18] R. Kolpakov, G. Kucherov, On the sum of exponents of maximal repetitions in a word, Technical Report 99-R-034, LORIA, France, 1999.; R. Kolpakov, G. Kucherov, On the sum of exponents of maximal repetitions in a word, Technical Report 99-R-034, LORIA, France, 1999. · Zbl 0948.68139
[19] Kosaraju, S. R., Computation of squares in a string, (Crochemore, M.; Gusfield, D., Proc. Fifth Ann. Symp. on Combinatorial Pattern Matching, CPM 94, vol. 807 of Lecture Notes in Computer Science (1994), Springer: Springer Berlin), 146-150
[20] Kurtz, S.; Schleiermacher, C., REPuter: Fast computation of maximal repeats in complete genomes, Bioinformatics, 15, 5, 426-427 (1999)
[21] G.M. Landau, Personal communication.; G.M. Landau, Personal communication.
[22] Landau, G. M.; Schmidt, J. P., An algorithm for approximate tandem repeats, (Apostolico, A.; Crochemore, M.; Galil, Z.; Manber, U., Proc. Fourth Ann. Symp. on Combinatorial Pattern Matching, CPM 93, vol. 684 of Lecture Notes in Computer Science (1993), Springer: Springer Berlin), 120-133
[23] Main, M. G.; Lorentz, R. J., An \(O(n log n)\) algorithm for finding all repetitions in a string, J. Algorithms, 5, 422-432 (1984) · Zbl 0547.68083
[24] Main, M. G.; Lorentz, R. J., Linear time recognition of squarefree strings, (Apostolico, A.; Galil, Z., Combinatorial Algorithms on Words, vol. F12 of NATO ASI Series (1985), Springer: Springer Berlin), 271-278 · Zbl 0572.68068
[25] Manber, U.; Myers, E. W., Suffix arraysA new method for on-line search, SIAM J. Comput., 22, 935-948 (1993) · Zbl 0784.68027
[26] Martinez, H. M., An efficient method for finding repeats in molecular sequences, Nucleic Acids Res., 11, 13, 4629-4634 (1983)
[27] McCreight, E. M., A space-economical suffix tree construction algorithm, J. Assoc. Comput. Mach., 23, 2, 262-272 (1976) · Zbl 0329.68042
[28] J.P. Schmidt, Personal communication.; J.P. Schmidt, Personal communication.
[29] P.F. Stelling, Applications of Combinatorial Analysis to Repetitions in Strings, Phylogeny, and Parallel Multiplier Design. Ph. D. Dissertation, Department of Computer Science, University of California, Davis, 1995.; P.F. Stelling, Applications of Combinatorial Analysis to Repetitions in Strings, Phylogeny, and Parallel Multiplier Design. Ph. D. Dissertation, Department of Computer Science, University of California, Davis, 1995.
[30] Stoye, J.; Gusfield, D., Simple and flexible detection of contiguous repeats using a suffix tree, (Farach, M., Proc. 9th Ann. Symp. on Combinatorial Pattern Matching, CPM 98, vol. 1448 of Lecture Notes in Computer Science (1998), Springer: Springer Berlin), 140-152
[31] Trifonov, E. N., Elucidating sequence codes: Three codes for evolution, Ann. N. Y. Acad. Sci., 870, 330-338 (1999)
[32] Tsunoda, T.; Fukagawa, M.; Takagi, T., Time and memory efficient algorithm for extracting palindromic and repetitive subsequences in nucleic acid sequences, (Proc. Fourth Pacific Symp. on Biocomputing, PSB 99 (1999), IEEE Press: IEEE Press New York), 202-213
[33] Ukkonen, E., On-line construction of suffix trees, Algorithmica, 14, 249-260 (1995) · Zbl 0831.68027
[34] Weiner, P., Linear pattern matching algorithms, (IEEE 14th Ann. Symp. on Switching and Automata Theory (1973), IEEE Press: IEEE Press New York), 1-11
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.