×

Space-efficient construction of Lempel-Ziv compressed text indexes. (English) Zbl 1220.68051

Summary: A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memory, avoiding the slower access to secondary storage. In particular, the LZ-index stands out for its good performance at extracting text passages and locating pattern occurrences. Given a text \(T[1..u]\) over an alphabet of size \(\sigma\), the LZ-index requires \(4|LZ|(1+o(1))\) bits of space, where \(|LZ|\) is the size of the LZ78-compression of \(T\). This can be bounded by \(|LZ|=uH_k(T)+o(u\log \sigma)\), where \(H_{k}(T)\) is the \(k\)-th order empirical entropy of \(T\), for any \(k=o(\log_\sigma u)\).
The LZ-index is built in \(O(u\log \sigma)\) time, yet requiring \(O(u\log u)\) bits of main memory in the worst case. In practice, the LZ-index occupies 1.0–1.5 times the text size (and replaces the text), but its construction requires around 5 times the text size. This limits its applicability to medium-sized texts. In this paper we present a space-efficient algorithm to construct the LZ-index in \(O(u(\log \sigma+\log\log u))\) time and requiring \(4|LZ|(1+o(1))\) bits of main memory, that is, asymptotically the same space of the final index.
We also adapt our algorithm to construct more recent reduced versions of the LZ-index, which occupy from 1 to 3 times \(|LZ|(1+o(1))\) bits, and show that these can also be built using asymptotically the same space of the final index. Finally, we study an alternative model in which we are given only a limited amount of main memory to carry out the indexing process (less than that required by the final index), and must use the disk for the rest. We show how to build all the LZ-index variants in \(O(u(\log \sigma+\log\log u))\) time, and within \(|LZ|(1+o(1))\) bits of main memory, that is, asymptotically just the space to hold the LZ78-compressed text. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index, and being competitive with the best construction times of other compressed indexes.

MSC:

68P05 Data structures
68P15 Database theory
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)

Software:

PATRICIA
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Apostolico, A., The myriad virtues of subword trees, (Combinatorial Algorithms on Words. Combinatorial Algorithms on Words, NATO ISI Series (1985), Springer-Verlag), 85-96 · Zbl 0572.68067
[2] Arroyuelo, D., An improved succinct representation for dynamic \(k\)-ary trees, (Proc. 19th Annual Symposium on Combinatorial Pattern Matching (CPM). Proc. 19th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS, vol. 5029 (2008)), 277-289 · Zbl 1143.68381
[3] Arroyuelo, D.; Cánovas, R.; Navarro, G.; Sadakane, K., Succinct trees in practice, (Proc. 11th Workshop on Algorithm Engineering and Experiments (ALENEX) (2010)), 84-97 · Zbl 1429.68044
[4] Arroyuelo, D.; Navarro, G., Space-efficient construction of LZ-index, (Proc. 16th Annual International Symposium on Algorithms and Computation (ISAAC). Proc. 16th Annual International Symposium on Algorithms and Computation (ISAAC), LNCS, vol. 3827 (2005), Springer), 1143-1152 · Zbl 1175.68117
[5] Arroyuelo, D.; Navarro, G., Practical approaches to reduce the space requirement of Lempel-Ziv-based compressed text indices, ACM Journal of Experimental Algorithmics, 15, 1.5 (2010) · Zbl 1284.68253
[6] Arroyuelo, D.; Navarro, G.; Sadakane, K., Reducing the space requirement of LZ-index, (Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM). Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS, vol. 4009 (2006)), 319-330 · Zbl 1196.68076
[7] D. Arroyuelo, G. Navarro, K. Sadakane, Stronger Lempel-Ziv based compressed text indexing, Algorithmica (2011), doi:10.1007/s00453-010-9443-8; D. Arroyuelo, G. Navarro, K. Sadakane, Stronger Lempel-Ziv based compressed text indexing, Algorithmica (2011), doi:10.1007/s00453-010-9443-8 · Zbl 1241.68061
[8] Benoit, D.; Demaine, E.; Munro, J. I.; Raman, R.; Raman, V.; Rao, S. S., Representing trees of higher degree, Algorithmica, 43, 4, 275-292 (2005) · Zbl 1086.68034
[9] Brodnik, A.; Carlsson, S.; Demaine, E.; Munro, J. I.; Sedgewick, R., Resizable arrays in optimal time and space, (Proc. WADS. Proc. WADS, LNCS, vol. 1663 (1999), Springer), 37-48 · Zbl 1063.68572
[10] M. Burrows, D.J. Wheeler, A block-sorting lossless data compression algorithm, Tech. Rep. 124, Digital Equipment Corporation, 1994.; M. Burrows, D.J. Wheeler, A block-sorting lossless data compression algorithm, Tech. Rep. 124, Digital Equipment Corporation, 1994.
[11] Chan, H.-L.; Hon, W.-K.; Lam, T.-W.; Sadakane, K., Compressed indexes for dynamic text collections, ACM Transactions on Algorithms, 3, 2 (2007), Article 21 · Zbl 1321.68261
[12] Chazelle, B., A functional approach to data structures and its use in multidimensional searching, SIAM Journal on Computing, 17, 3, 427-462 (1988) · Zbl 0679.68074
[13] Clark, D.; Munro, J. I., Efficient suffix trees on secondary storage, (Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (1996)), 383-391 · Zbl 0847.68030
[14] Cormen, T.; Leiserson, C.; Rivest, R.; Stein, C., Introduction to Algorithms (2001), Prentice Hall
[15] Dementiev, R.; Kärkkäinen, J.; Mehnert, J.; Sanders, P., Better external memory suffix array construction, ACM Journal of Experimental Algorithmics (JEA), 12, 1-24 (2008), Article 3.4 · Zbl 1365.68178
[16] Ferragina, P.; Gagie, T.; Manzini, G., Lightweight data indexing and compression in external memory, (Proc. 8th Latin American Symposium on Theoretical Informatics (LATIN) (2010)), 697-710 · Zbl 1283.68140
[17] Ferragina, P.; González, R.; Navarro, G.; Venturini, R., Compressed text indexes: From theory to practice, ACM Journal of Experimental Algorithmics (JEA), 13 (2009), Article 12 · Zbl 1284.68255
[18] Ferragina, P.; Manzini, G., Indexing compressed text, Journal of the ACM, 54, 4, 552-581 (2005) · Zbl 1323.68261
[19] Ferragina, P.; Manzini, G.; Mäkinen, V.; Navarro, G., Compressed representations of sequences and full-text indexes, ACM Transactions on Algorithms, 3, 2 (2007), Article 20 · Zbl 1321.68263
[20] Fich, F.; Munro, J. I.; Poblete, P., Permuting in place, SIAM Journal on Computing, 24, 2, 266-278 (1995) · Zbl 0939.68600
[21] Franceschini, G.; Muthukrishnan, S., In-place suffix sorting, (Proc. of 34th International Colloquium on Automata, Languages and Programming (ICALP). Proc. of 34th International Colloquium on Automata, Languages and Programming (ICALP), LNCS, vol. 4596 (2007)), 533-546 · Zbl 1171.68445
[22] Geary, R.; Rahman, N.; Raman, R.; Raman, V., A simple optimal representation for balanced parentheses, Theoretical Computer Science, 368, 3, 231-246 (2006) · Zbl 1103.68040
[23] González, R.; Grabowski, S.; Mäkinen, V.; Navarro, G., Practical implementation of rank and select queries, (Poster Proc. Vol. of 4th Workshop on Experimental and Efficient Algorithms (WEA) (2005), CTI Press and Ellinika Grammata), 27-38
[24] González, R.; Navarro, G., Rank/select on dynamic compressed sequences and applications, Theoretical Computer Science, 410, 4414-4422 (2008) · Zbl 1194.68103
[25] Grossi, R.; Gupta, A.; Vitter, J. S., High-order entropy-compressed text indexes, (Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2003)), 841-850 · Zbl 1092.68584
[26] Grossi, R.; Vitter, J. S., Compressed suffix arrays and suffix trees with applications to text indexing and string matching, SIAM Journal on Computing, 35, 2, 378-407 (2005) · Zbl 1092.68115
[27] He, M.; Munro, I., Succinct representations of dynamic strings, (Proc. 17th International Symposium on String Processing and Information Retrieval (SPIRE). Proc. 17th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS, vol. 6393 (2010)), 334-346
[28] W.-K. Hon, On the construction and application of compressed text indexes, PhD thesis, University of Hong Kong, 2004.; W.-K. Hon, On the construction and application of compressed text indexes, PhD thesis, University of Hong Kong, 2004.
[29] Hon, W. K.; Lam, T. W.; Sadakane, K.; Sung, W. K., Constructing compressed suffix arrays with large alphabets, (Proc. 14th Annual International Symposium on Algorithms and Computation (ISAAC). Proc. 14th Annual International Symposium on Algorithms and Computation (ISAAC), LNCS, vol. 2906 (2003)), 240-249 · Zbl 1205.68128
[30] Hon, W. K.; Lam, T. W.; Sadakane, K.; Sung, W.-K.; Yiu, M., A space and time efficient algorithm for constructing compressed suffix arrays, Algorithmica, 48, 1, 23-36 (2007) · Zbl 1123.68137
[31] Hon, W.-K.; Sadakane, K.; Sung, W.-K., Breaking a time-and-space barrier in constructing full-text indices, SIAM Journal on Computing, 38, 6, 2162-2178 (2009) · Zbl 1191.68225
[32] Sirén, J., Compressed suffix arrays for massive data, (Proc. 16th International Symposium on String Processing and Information Retrieval (SPIRE). Proc. 16th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS, vol. 5721 (2009)), 63-74
[33] Jansson, J.; Sadakane, K.; Sung, W.-K., Compressed dynamic tries with applications to LZ-compression in sublinear time and space, (27th Int. Conf. on Foundations of Software Technology and Theoretical Computer Science (FSTTCS) (2007)), 424-435 · Zbl 1135.68374
[34] Jansson, J.; Sadakane, K.; Sung, W.-K., Ultra-succinct representation of ordered trees, (Proc. 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2007)), 575-584 · Zbl 1302.68100
[35] Kärkkäinen, J., Fast BWT in small space by blockwise suffix sorting, Theoretical Computer Science, 387, 3, 249-257 (2007) · Zbl 1144.68021
[36] Kärkkäinen, J.; Ukkonen, E., Lempel-Ziv parsing and sublinear-size index structures for string matching, (Proc. 3rd South American Workshop on String Processing (WSP) (1996)), 141-155
[37] Kim, D.; Na, J.; Kim, J.; Park, K., Efficient implementation of rank and select functions for succinct representation, (Proc. 4th Workshop on Experimental and Efficient Algorithms (WEA). Proc. 4th Workshop on Experimental and Efficient Algorithms (WEA), LNCS, vol. 3503 (2005)), 315-327 · Zbl 1121.68461
[38] Kosaraju, R.; Manzini, G., Compression of low entropy strings with Lempel-Ziv algorithms, SIAM Journal on Computing, 29, 3, 893-911 (1999) · Zbl 0941.68055
[39] Kurtz, S., Reducing the space requirements of suffix trees, Software Practice and Experience, 29, 13, 1149-1171 (1999)
[40] Larsson, J.; Sadakane, K., Faster suffix sorting, Theoretical Computer Science, 387, 3, 258-272 (2007) · Zbl 1144.68022
[41] Mäkinen, V., Compact suffix array — a space-efficient full-text index, Fundamenta Informaticae, 56, 1-2, 191-210 (2003) · Zbl 1031.68053
[42] Mäkinen, V.; Navarro, G., Succinct suffix arrays based on run-length encoding, Nordic Journal of Computing, 12, 1, 40-66 (2005) · Zbl 1085.68031
[43] Mäkinen, V.; Navarro, G., Rank and select revisited and extended, Theoretical Computer Science, 387, 3, 332-347 (2007) · Zbl 1144.68023
[44] Mäkinen, V.; Navarro, G., Dynamic entropy-compressed sequences and full-text indexes, ACM Transactions on Algorithms, 4, 3 (2008), Article 32 · Zbl 1446.68043
[45] Mäkinen, V.; Navarro, G.; Sirén, J.; Välimäki, N., Storage and retrieval of highly repetitive sequence collections, Journal of Computational Biology, 17, 3, 281-308 (2010)
[46] Manber, U.; Myers, G., Suffix arrays: A new method for on-line string searches, SIAM Journal on Computing, 22, 5, 935-948 (1993) · Zbl 0784.68027
[47] Manzini, G., An analysis of the Burrows-Wheeler transform, Journal of the ACM, 48, 3, 407-430 (2001) · Zbl 1323.68262
[48] Manzini, G.; Ferragina, P., Engineering a lightweight suffix array construction algorithm, Algorithmica, 40, 1, 33-50 (2004) · Zbl 1082.68867
[49] Morrison, D. R., Patricia — practical algorithm to retrieve information coded in alphanumeric, Journal of the ACM, 15, 4, 514-534 (1968)
[50] Munro, I.; Raman, V.; Rao, S., Space efficient suffix trees, Journal of Algorithms, 39, 2, 205-222 (2001) · Zbl 0977.68069
[51] Munro, J. I., Tables, (Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), LNCS, vol. 1180 (1996)), 37-42
[52] Munro, J. I.; Raman, R.; Raman, V.; Rao, S. S., Succinct representations of permutations, (Proc. 30th International Colloquium on Automata, Languages and Computation (ICALP). Proc. 30th International Colloquium on Automata, Languages and Computation (ICALP), LNCS, vol. 2719 (2003)), 345-356 · Zbl 1039.68546
[53] Munro, J. I.; Raman, V., Succinct representation of balanced parentheses and static trees, SIAM Journal on Computing, 31, 3, 762-776 (2001) · Zbl 1017.68037
[54] Na, J.; Park, K., Alphabet-independent linear-time construction of compressed suffix arrays using \(o(n \log n)\)-bit working space, Theoretical Computer Science, 385, 127-136 (2007) · Zbl 1125.68041
[55] Navarro, G., Indexing text using the Ziv-Lempel trie, Journal of Discrete Algorithms (JDA), 2, 1, 87-114 (2004) · Zbl 1118.68443
[56] Navarro, G., Implementing the LZ-index: Theory versus practice, ACM Journal of Experimental Algorithmics (JEA), 13 (2009), Article 2 · Zbl 1284.68258
[57] Navarro, G.; Mäkinen, V., Compressed full-text indexes, ACM Computing Surveys, 39, 1 (2007), Article 2 · Zbl 1321.68263
[58] Navarro, G.; Moura, E.; Neubert, M.; Ziviani, N.; Baeza-Yates, R., Adding compression to block addressing inverted indexes, Information Retrieval, 3, 1, 49-77 (2000)
[59] Navarro, G.; Sadakane, K., Fully-functional static and dynamic succinct trees (2010), Tech. Rep. · Zbl 1288.05046
[60] Okanohara, D.; Sadakane, K., Practical entropy-compressed rank/select dictionary, (Proc. Workshop on Algorithm Engineering and Experiments (ALENEX) (2007)), 60-70 · Zbl 1428.68134
[61] Pătraşcu, M., Succincter, (Proc. 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS) (2008)), 305-313
[62] Raman, R.; Raman, V.; Rao, S. S., Succinct indexable dictionaries with applications to encoding \(k\)-ary trees and multisets, (Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2002)), 233-242 · Zbl 1093.68582
[63] Raman, R.; Rao, S. S., Succinct dynamic dictionaries and trees, (Proc. 30th International Colloquium on Automata, Languages and Computation (ICALP). Proc. 30th International Colloquium on Automata, Languages and Computation (ICALP), LNCS, vol. 2719 (2003)), 357-368 · Zbl 1039.68043
[64] Russo, L.; Oliveira, A., A compressed self-index using a Ziv-Lempel dictionary, Information Retrieval, 5, 3, 501-513 (2007)
[65] Sadakane, K., New text indexing functionalities of the compressed suffix arrays, Journal of Algorithms, 48, 2, 294-313 (2003) · Zbl 1100.68563
[66] Sadakane, K.; Grossi, R., Squeezing succinct data structures into entropy bounds, (Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2006)), 1230-1239 · Zbl 1192.68188
[67] Sadakane, K.; Navarro, G., Fully-functional succinct trees, (Proc. 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2010)), 134-149 · Zbl 1288.05046
[68] Sadakane, K.; Shibuya, T., Indexing huge genome sequences for solving various problems, Genome Informatics, 12, 175-183 (2001)
[69] Vitter, J. S., Algorithms and Data Structures for External Memory, Foundations and Trends in Theoretical Computer Science (2008), Now Publishers
[70] Weiner, P., Linear pattern matching algorithms, (Proc. 14th Annual Symposium on Foundations of Computer Science (FOCS) (1973)), 1-11
[71] Witten, I.; Moffat, A.; Bell, T., Managing Gigabytes (1999), Morgan Kaufmann Publishers
[72] Ziv, J.; Lempel, A., A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, 23, 3, 337-343 (1977) · Zbl 0379.94010
[73] Ziv, J.; Lempel, A., Compression of individual sequences via variable-rate coding, IEEE Transactions on Information Theory, 24, 5, 530-536 (1978) · Zbl 0392.94004
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.