×

zbMATH — the first resource for mathematics

Space-efficient construction of Lempel-Ziv compressed text indexes. (English) Zbl 1220.68051
Summary: A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memory, avoiding the slower access to secondary storage. In particular, the LZ-index stands out for its good performance at extracting text passages and locating pattern occurrences. Given a text \(T[1..u]\) over an alphabet of size \(\sigma\), the LZ-index requires \(4|LZ|(1+o(1))\) bits of space, where \(|LZ|\) is the size of the LZ78-compression of \(T\). This can be bounded by \(|LZ|=uH_k(T)+o(u\log \sigma)\), where \(H_{k}(T)\) is the \(k\)-th order empirical entropy of \(T\), for any \(k=o(\log_\sigma u)\).
The LZ-index is built in \(O(u\log \sigma)\) time, yet requiring \(O(u\log u)\) bits of main memory in the worst case. In practice, the LZ-index occupies 1.0–1.5 times the text size (and replaces the text), but its construction requires around 5 times the text size. This limits its applicability to medium-sized texts. In this paper we present a space-efficient algorithm to construct the LZ-index in \(O(u(\log \sigma+\log\log u))\) time and requiring \(4|LZ|(1+o(1))\) bits of main memory, that is, asymptotically the same space of the final index.
We also adapt our algorithm to construct more recent reduced versions of the LZ-index, which occupy from 1 to 3 times \(|LZ|(1+o(1))\) bits, and show that these can also be built using asymptotically the same space of the final index. Finally, we study an alternative model in which we are given only a limited amount of main memory to carry out the indexing process (less than that required by the final index), and must use the disk for the rest. We show how to build all the LZ-index variants in \(O(u(\log \sigma+\log\log u))\) time, and within \(|LZ|(1+o(1))\) bits of main memory, that is, asymptotically just the space to hold the LZ78-compressed text. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index, and being competitive with the best construction times of other compressed indexes.

MSC:
68P05 Data structures
68P15 Database theory
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
Software:
PATRICIA
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Apostolico, A., The myriad virtues of subword trees, (), 85-96 · Zbl 0572.68067
[2] Arroyuelo, D., An improved succinct representation for dynamic k-ary trees, (), 277-289 · Zbl 1143.68381
[3] Arroyuelo, D.; Cánovas, R.; Navarro, G.; Sadakane, K., Succinct trees in practice, (), 84-97
[4] Arroyuelo, D.; Navarro, G., Space-efficient construction of LZ-index, (), 1143-1152 · Zbl 1175.68117
[5] Arroyuelo, D.; Navarro, G., Practical approaches to reduce the space requirement of Lempel-Ziv-based compressed text indices, ACM journal of experimental algorithmics, 15, 1.5, (2010) · Zbl 1284.68253
[6] Arroyuelo, D.; Navarro, G.; Sadakane, K., Reducing the space requirement of LZ-index, (), 319-330 · Zbl 1196.68076
[7] D. Arroyuelo, G. Navarro, K. Sadakane, Stronger Lempel-Ziv based compressed text indexing, Algorithmica (2011), doi:10.1007/s00453-010-9443-8, in press. · Zbl 1241.68061
[8] Benoit, D.; Demaine, E.; Munro, J.I.; Raman, R.; Raman, V.; Rao, S.S., Representing trees of higher degree, Algorithmica, 43, 4, 275-292, (2005) · Zbl 1086.68034
[9] Brodnik, A.; Carlsson, S.; Demaine, E.; Munro, J.I.; Sedgewick, R., Resizable arrays in optimal time and space, (), 37-48 · Zbl 1063.68572
[10] M. Burrows, D.J. Wheeler, A block-sorting lossless data compression algorithm, Tech. Rep. 124, Digital Equipment Corporation, 1994.
[11] Chan, H.-L.; Hon, W.-K.; Lam, T.-W.; Sadakane, K., Compressed indexes for dynamic text collections, ACM transactions on algorithms, 3, 2, (2007), Article 21 · Zbl 1321.68261
[12] Chazelle, B., A functional approach to data structures and its use in multidimensional searching, SIAM journal on computing, 17, 3, 427-462, (1988) · Zbl 0679.68074
[13] Clark, D.; Munro, J.I., Efficient suffix trees on secondary storage, (), 383-391 · Zbl 0847.68030
[14] Cormen, T.; Leiserson, C.; Rivest, R.; Stein, C., Introduction to algorithms, (2001), Prentice Hall
[15] Dementiev, R.; Kärkkäinen, J.; Mehnert, J.; Sanders, P., Better external memory suffix array construction, ACM journal of experimental algorithmics (JEA), 12, 1-24, (2008), Article 3.4 · Zbl 1365.68178
[16] Ferragina, P.; Gagie, T.; Manzini, G., Lightweight data indexing and compression in external memory, (), 697-710 · Zbl 1283.68140
[17] Ferragina, P.; González, R.; Navarro, G.; Venturini, R., Compressed text indexes: from theory to practice, ACM journal of experimental algorithmics (JEA), 13, (2009), Article 12 · Zbl 1284.68255
[18] Ferragina, P.; Manzini, G., Indexing compressed text, Journal of the ACM, 54, 4, 552-581, (2005) · Zbl 1323.68261
[19] Ferragina, P.; Manzini, G.; Mäkinen, V.; Navarro, G., Compressed representations of sequences and full-text indexes, ACM transactions on algorithms, 3, 2, (2007), Article 20 · Zbl 1321.68263
[20] Fich, F.; Munro, J.I.; Poblete, P., Permuting in place, SIAM journal on computing, 24, 2, 266-278, (1995) · Zbl 0939.68600
[21] Franceschini, G.; Muthukrishnan, S., In-place suffix sorting, (), 533-546 · Zbl 1171.68445
[22] Geary, R.; Rahman, N.; Raman, R.; Raman, V., A simple optimal representation for balanced parentheses, Theoretical computer science, 368, 3, 231-246, (2006) · Zbl 1103.68040
[23] González, R.; Grabowski, S.; Mäkinen, V.; Navarro, G., Practical implementation of rank and select queries, (), 27-38
[24] González, R.; Navarro, G., Rank/select on dynamic compressed sequences and applications, Theoretical computer science, 410, 4414-4422, (2008) · Zbl 1194.68103
[25] Grossi, R.; Gupta, A.; Vitter, J.S., High-order entropy-compressed text indexes, (), 841-850 · Zbl 1092.68584
[26] Grossi, R.; Vitter, J.S., Compressed suffix arrays and suffix trees with applications to text indexing and string matching, SIAM journal on computing, 35, 2, 378-407, (2005) · Zbl 1092.68115
[27] He, M.; Munro, I., Succinct representations of dynamic strings, (), 334-346
[28] W.-K. Hon, On the construction and application of compressed text indexes, PhD thesis, University of Hong Kong, 2004.
[29] Hon, W.K.; Lam, T.W.; Sadakane, K.; Sung, W.K., Constructing compressed suffix arrays with large alphabets, (), 240-249 · Zbl 1205.68128
[30] Hon, W.K.; Lam, T.W.; Sadakane, K.; Sung, W.-K.; Yiu, M., A space and time efficient algorithm for constructing compressed suffix arrays, Algorithmica, 48, 1, 23-36, (2007) · Zbl 1123.68137
[31] Hon, W.-K.; Sadakane, K.; Sung, W.-K., Breaking a time-and-space barrier in constructing full-text indices, SIAM journal on computing, 38, 6, 2162-2178, (2009) · Zbl 1191.68225
[32] Sirén, J., Compressed suffix arrays for massive data, (), 63-74
[33] Jansson, J.; Sadakane, K.; Sung, W.-K., Compressed dynamic tries with applications to LZ-compression in sublinear time and space, (), 424-435 · Zbl 1135.68374
[34] Jansson, J.; Sadakane, K.; Sung, W.-K., Ultra-succinct representation of ordered trees, (), 575-584 · Zbl 1302.68100
[35] Kärkkäinen, J., Fast BWT in small space by blockwise suffix sorting, Theoretical computer science, 387, 3, 249-257, (2007) · Zbl 1144.68021
[36] Kärkkäinen, J.; Ukkonen, E., Lempel-Ziv parsing and sublinear-size index structures for string matching, (), 141-155
[37] Kim, D.; Na, J.; Kim, J.; Park, K., Efficient implementation of rank and select functions for succinct representation, (), 315-327 · Zbl 1121.68461
[38] Kosaraju, R.; Manzini, G., Compression of low entropy strings with Lempel-Ziv algorithms, SIAM journal on computing, 29, 3, 893-911, (1999) · Zbl 0941.68055
[39] Kurtz, S., Reducing the space requirements of suffix trees, Software practice and experience, 29, 13, 1149-1171, (1999)
[40] Larsson, J.; Sadakane, K., Faster suffix sorting, Theoretical computer science, 387, 3, 258-272, (2007) · Zbl 1144.68022
[41] Mäkinen, V., Compact suffix array — a space-efficient full-text index, Fundamenta informaticae, 56, 1-2, 191-210, (2003) · Zbl 1031.68053
[42] Mäkinen, V.; Navarro, G., Succinct suffix arrays based on run-length encoding, Nordic journal of computing, 12, 1, 40-66, (2005) · Zbl 1085.68031
[43] Mäkinen, V.; Navarro, G., Rank and select revisited and extended, Theoretical computer science, 387, 3, 332-347, (2007) · Zbl 1144.68023
[44] Mäkinen, V.; Navarro, G., Dynamic entropy-compressed sequences and full-text indexes, ACM transactions on algorithms, 4, 3, (2008), Article 32
[45] Mäkinen, V.; Navarro, G.; Sirén, J.; Välimäki, N., Storage and retrieval of highly repetitive sequence collections, Journal of computational biology, 17, 3, 281-308, (2010)
[46] Manber, U.; Myers, G., Suffix arrays: A new method for on-line string searches, SIAM journal on computing, 22, 5, 935-948, (1993) · Zbl 0784.68027
[47] Manzini, G., An analysis of the Burrows-Wheeler transform, Journal of the ACM, 48, 3, 407-430, (2001) · Zbl 1323.68262
[48] Manzini, G.; Ferragina, P., Engineering a lightweight suffix array construction algorithm, Algorithmica, 40, 1, 33-50, (2004) · Zbl 1082.68867
[49] Morrison, D.R., Patricia — practical algorithm to retrieve information coded in alphanumeric, Journal of the ACM, 15, 4, 514-534, (1968)
[50] Munro, I.; Raman, V.; Rao, S., Space efficient suffix trees, Journal of algorithms, 39, 2, 205-222, (2001) · Zbl 0977.68069
[51] Munro, J.I., Tables, (), 37-42
[52] Munro, J.I.; Raman, R.; Raman, V.; Rao, S.S., Succinct representations of permutations, (), 345-356 · Zbl 1039.68546
[53] Munro, J.I.; Raman, V., Succinct representation of balanced parentheses and static trees, SIAM journal on computing, 31, 3, 762-776, (2001) · Zbl 1017.68037
[54] Na, J.; Park, K., Alphabet-independent linear-time construction of compressed suffix arrays using \(o(n \log n)\)-bit working space, Theoretical computer science, 385, 127-136, (2007) · Zbl 1125.68041
[55] Navarro, G., Indexing text using the Ziv-Lempel trie, Journal of discrete algorithms (JDA), 2, 1, 87-114, (2004) · Zbl 1118.68443
[56] Navarro, G., Implementing the LZ-index: theory versus practice, ACM journal of experimental algorithmics (JEA), 13, (2009), Article 2 · Zbl 1284.68258
[57] Navarro, G.; Mäkinen, V., Compressed full-text indexes, ACM computing surveys, 39, 1, (2007), Article 2 · Zbl 1321.68263
[58] Navarro, G.; Moura, E.; Neubert, M.; Ziviani, N.; Baeza-Yates, R., Adding compression to block addressing inverted indexes, Information retrieval, 3, 1, 49-77, (2000)
[59] Navarro, G.; Sadakane, K., Fully-functional static and dynamic succinct trees, (2010), Tech. Rep. · Zbl 1288.05046
[60] Okanohara, D.; Sadakane, K., Practical entropy-compressed rank/select dictionary, (), 60-70
[61] Pătraşcu, M., Succincter, (), 305-313
[62] Raman, R.; Raman, V.; Rao, S.S., Succinct indexable dictionaries with applications to encoding k-ary trees and multisets, (), 233-242 · Zbl 1093.68582
[63] Raman, R.; Rao, S.S., Succinct dynamic dictionaries and trees, (), 357-368 · Zbl 1039.68043
[64] Russo, L.; Oliveira, A., A compressed self-index using a Ziv-Lempel dictionary, Information retrieval, 5, 3, 501-513, (2007)
[65] Sadakane, K., New text indexing functionalities of the compressed suffix arrays, Journal of algorithms, 48, 2, 294-313, (2003) · Zbl 1100.68563
[66] Sadakane, K.; Grossi, R., Squeezing succinct data structures into entropy bounds, (), 1230-1239 · Zbl 1192.68188
[67] Sadakane, K.; Navarro, G., Fully-functional succinct trees, (), 134-149 · Zbl 1288.05046
[68] Sadakane, K.; Shibuya, T., Indexing huge genome sequences for solving various problems, Genome informatics, 12, 175-183, (2001)
[69] Vitter, J.S., Algorithms and data structures for external memory, Foundations and trends in theoretical computer science, (2008), Now Publishers
[70] Weiner, P., Linear pattern matching algorithms, (), 1-11
[71] Witten, I.; Moffat, A.; Bell, T., Managing gigabytes, (1999), Morgan Kaufmann Publishers
[72] Ziv, J.; Lempel, A., A universal algorithm for sequential data compression, IEEE transactions on information theory, 23, 3, 337-343, (1977) · Zbl 0379.94010
[73] Ziv, J.; Lempel, A., Compression of individual sequences via variable-rate coding, IEEE transactions on information theory, 24, 5, 530-536, (1978) · Zbl 0392.94004
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.