×

Space-time trade-offs for finding shortest unique substrings and maximal unique matches. (English) Zbl 1379.68373

Summary: Given a string \(\mathsf{X}[1,n]\) and a position \(k\in [1,n]\), a Shortest Unique Substring of \(\mathsf{X}\) covering \(k\), denoted by \(\mathsf{S}_k\), is a substring \(\mathsf{X}[i,j]\) of \(\mathsf{X}\) which satisfies the following conditions: (i) \(i\leq k\leq j\), (ii) \(i\) is the only position where there is an occurrence of \(\mathsf{X}[i,j]\), and (iii) \(j-i\) is minimized. Current best-known algorithms for finding \(\mathsf{S}_k\) require \(\Theta(n)\) words of working space, and \(O(n)\) time. Let \(\tau\) be a given parameter. We present the following new results.
Given a \(k\in [1,n]\), we can compute \(\mathsf{S}_k\) in \(O(n\tau^2\log\frac{n}{\tau})\) time using \(\mathsf{X}\) and an additional \(O(n/\tau)\) words of working space.
For every \(k\in [1,n]\), we can compute \(\mathsf{S}_k\) in \(O(n\tau^2\log\frac{n}{\tau})\) time using \(\mathsf{X}\), and an additional \(O(n/\tau)\) words and \(4n+o(n)\) bits of working space.
We present an \(O(n\tau\log^{c+1}n)\)-time randomized algorithm that uses \(n/\log^cn\) words in addition to that mentioned above, where \(c\geq 0\) is an arbitrary constant. In this case, the reported string is unique and covers \(k\), but, with probability at most \(n^{-O(1)}\), may not be the shortest.
By choosing \(\tau=\omega(1)\), our results imply the first sub-linear space (in addition to the input string) solution to these problems. We also present the following two results.
An algorithm that finds \(\mathsf{S}_k\) for every \(k\in [1,n]\) using \(O(n\log\sigma)\) bits of working space in \(O(n\log n)\) time, where \(\sigma\leq n\) is the number of distinct symbols in \(\mathsf{X}\).
A \(4n+o(n)\)-bit index that can report \(\mathsf{S}_k\) for any \(k\) in \(O(1)\) time.
As a consequence of our techniques, we also obtain similar space-and-time tradeoffs for a related problem of finding Maximal Unique Matches of two strings [A. L. Delcher et al., “Alignment of whole genomes”, Nucleic Acids Res. 27, No. 11, 2369–2376 (1999)].

MSC:

68W32 Algorithms on strings
68P05 Data structures
68Q25 Analysis of algorithms and problem complexity
68W20 Randomized algorithms

Software:

MUMMER
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Weiner, P., Linear pattern matching algorithms, (14th Annual Symposium on Switching and Automata Theory. 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973 (1973)), 1-11
[2] Manber, U.; Myers, E. W., Suffix arrays: a new method for on-line string searches, SIAM J. Comput., 22, 5, 935-948 (1993) · Zbl 0784.68027
[3] Gusfield, D., Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology (1997), Cambridge University Press · Zbl 0934.68103
[4] Pei, J.; Wu, W. C.; Yeh, M., On shortest unique substring queries, (29th IEEE International Conference on Data Engineering. 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013 (2013)), 937-948
[5] Haubold, B.; Pierstorff, N.; Möller, F.; Wiehe, T., Genome comparison without alignment using shortest unique substrings, BMC Bioinform., 6, 1, 1-11 (2005)
[6] Derrien, T.; Estellé, J.; Marco Sola, S.; Knowles, D. G.; Raineri, E.; Guigó, R.; Ribeca, P., Fast computation and applications of genome mappability, PLoS ONE, 7, 1, 1-16 (2012)
[7] Adas, B.; Bayraktar, E.; Faro, S.; Moustafa, I. E.; Külekci, M. O., Nucleotide sequence alignment and compression via shortest unique substring, (Bioinformatics and Biomedical Engineering - Third International Conference. Proceedings, Part II. Bioinformatics and Biomedical Engineering - Third International Conference. Proceedings, Part II, IWBBIO 2015, Granada, Spain, April 15-17, 2015 (2015)), 363-374
[8] Ileri, A. M.; Külekci, M. O.; Xu, B., Shortest unique substring query revisited, (Combinatorial Pattern Matching - 25th Annual Symposium. Proceedings. Combinatorial Pattern Matching - 25th Annual Symposium. Proceedings, CPM 2014, Moscow, Russia, June 16-18, 2014 (2014)), 172-181 · Zbl 1409.68354
[9] Tsuruta, K.; Inenaga, S.; Bannai, H.; Takeda, M., Shortest unique substrings queries in optimal time, (SOFSEM 2014: Theory and Practice of Computer Science - 40th International Conference on Current Trends in Theory and Practice of Computer Science. Proceedings. SOFSEM 2014: Theory and Practice of Computer Science - 40th International Conference on Current Trends in Theory and Practice of Computer Science. Proceedings, Nový Smokovec, Slovakia, January 26-29, 2014 (2014)), 503-513 · Zbl 1432.68612
[10] Hon, W.; Thankachan, S. V.; Xu, B., An in-place framework for exact and approximate shortest unique substring queries, (Algorithms and Computation - 26th International Symposium. Proceedings. Algorithms and Computation - 26th International Symposium. Proceedings, ISAAC 2015, Nagoya, Japan, December 9-11, 2015 (2015)), 755-767 · Zbl 1472.68224
[11] Belazzougui, D.; Cunial, F., Indexed matching statistics and shortest unique substrings, (String Processing and Information Retrieval - 21st International Symposium. Proceedings. String Processing and Information Retrieval - 21st International Symposium. Proceedings, SPIRE 2014, Ouro Preto, Brazil, October 20-22, 2014 (2014)), 179-190
[12] Hu, X.; Pei, J.; Tao, Y., Shortest unique queries on strings, (String Processing and Information Retrieval - 21st International Symposium. Proceedings. String Processing and Information Retrieval - 21st International Symposium. Proceedings, SPIRE 2014, Ouro Preto, Brazil, October 20-22, 2014 (2014)), 161-172
[13] Mieno, T.; Inenaga, S.; Bannai, H.; Takeda, M., Shortest unique substring queries on run-length encoded strings, (41st International Symposium on Mathematical Foundations of Computer Science. 41st International Symposium on Mathematical Foundations of Computer Science, MFCS 2016, August 22-26, 2016 - Kraków, Poland (2016)), 69:1-69:11 · Zbl 1398.68109
[14] Delcher, A. L.; Kasif, S.; Fleischmann, R. D.; Peterson, J.; White, O.; Salzberg, S. L., Alignment of whole genomes, Nucleic Acids Res., 27, 11, 2369-2376 (1999)
[15] Delcher, A. L.; Phillippy, A.; Carlton, J.; Salzberg, S. L., Fast algorithms for large-scale genome alignment and comparison, Nucleic Acids Res., 30, 11, 2478-2483 (2002)
[16] Kurtz, S., Reducing the space requirement of suffix trees, Softw., Pract. Exp., 29, 13, 1149-1171 (1999)
[17] Kurtz, S.; Phillippy, A.; Delcher, A. L.; Smoot, M.; Shumway, M.; Antonescu, C.; Salzberg, S. L., Versatile and open software for comparing large genomes, Genome Biol., 5, 2, R12 (2004)
[18] Hon, W.; Sadakane, K., Space-economical algorithms for finding maximal unique matches, (Combinatorial Pattern Matching, 13th Annual Symposium. Proceedings. Combinatorial Pattern Matching, 13th Annual Symposium. Proceedings, CPM 2002, Fukuoka, Japan, July 3-5, 2002 (2002)), 144-152 · Zbl 1077.68946
[19] McCreight, E. M., A space-economical suffix tree construction algorithm, J. ACM, 23, 2, 262-272 (1976) · Zbl 0329.68042
[20] Ukkonen, E., On-line construction of suffix trees, Algorithmica, 14, 3, 249-260 (1995) · Zbl 0831.68027
[21] Hon, W.; Lam, T. W.; Shah, R.; Tam, S.; Vitter, J. S., Compressed index for dictionary matching, (2008 Data Compression Conference. 2008 Data Compression Conference, DCC 2008, 25-27 March 2008, Snowbird, UT, USA (2008)), 23-32
[22] Farach, M., Optimal suffix tree construction with large alphabets, (38th Annual Symposium on Foundations of Computer Science. 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997 (1997)), 137-143
[23] Bender, M. A.; Farach-Colton, M., The level ancestor problem simplified, (LATIN 2002: Theoretical Informatics, 5th Latin American Symposium. Proceedings. LATIN 2002: Theoretical Informatics, 5th Latin American Symposium. Proceedings, Cancun, Mexico, April 3-6, 2002 (2002)), 508-515 · Zbl 1059.68563
[24] Bender, M. A.; Farach-Colton, M.; Pemmasani, G.; Skiena, S.; Sumazin, P., Lowest common ancestors in trees and directed acyclic graphs, J. Algorithms, 57, 2, 75-94 (2005) · Zbl 1085.68103
[25] Munro Tables, J. I., (Foundations of Software Technology and Theoretical Computer Science, 16th Conference. Proceedings. Foundations of Software Technology and Theoretical Computer Science, 16th Conference. Proceedings, Hyderabad, India, December 18-20, 1996 (1996)), 37-42
[26] Fischer, J.; Heun, V., A new succinct representation of rmq-information and improvements in the enhanced suffix array, (Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, First International Symposium. Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, First International Symposium, ESCAPE 2007, Hangzhou, China, April 7-9, 2007 (2007)), 459-470, Revised Selected Papers · Zbl 1176.68058
[27] Karp, R. M.; Rabin, M. O., Efficient randomized pattern-matching algorithms, IBM J. Res. Dev., 31, 2, 249-260 (1987) · Zbl 0653.68054
[28] Fredman, M. L.; Komlós, J.; Szemerédi, E., Storing a sparse table with o(1) worst case access time, J. ACM, 31, 3, 538-544 (1984) · Zbl 0629.68068
[29] Clifford, R.; Fontaine, A.; Porat, E.; Sach, B.; Starikovskaya, T. A., Dictionary matching in a stream, (Algorithms - ESA 2015 - 23rd Annual European Symposium. Proceedings. Algorithms - ESA 2015 - 23rd Annual European Symposium. Proceedings, Patras, Greece, September 14-16, 2015 (2015)), 361-372 · Zbl 1443.68218
[30] Belazzougui, D.; Boldi, P.; Pagh, R.; Vigna, S., Monotone minimal perfect hashing: searching a sorted table with \(O(1)\) accesses, (Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms. Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2009, New York, NY, USA, January 4-6, 2009 (2009)), 785-794 · Zbl 1423.68132
[31] Ferragina, P.; Manzini, G., Indexing compressed text, J. ACM, 52, 4, 552-581 (2005) · Zbl 1323.68261
[32] Grossi, R.; Vitter, J. S., Compressed suffix arrays and suffix trees with applications to text indexing and string matching, SIAM J. Comput., 35, 2, 378-407 (2005) · Zbl 1092.68115
[33] Hon, W.; Sadakane, K.; Sung, W., Breaking a time-and-space barrier in constructing full-text indices, SIAM J. Comput., 38, 6, 2162-2178 (2009) · Zbl 1191.68225
[34] Sadakane, K., Compressed suffix trees with full functionality, Theory Comput. Syst., 41, 4, 589-607 (2007) · Zbl 1148.68015
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.