×

zbMATH — the first resource for mathematics

Stronger Lempel-Ziv based compressed text indexing. (English) Zbl 1241.68061
Summary: Given a text \(T[1..u]\) over an alphabet of size \(\sigma \), the full-text search problem consists in finding the \(occ\) occurrences of a given pattern \(P[1..m]\) in \(T\). In indexed text searching we build an index on \(T\) to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed full-text self-indices, which replace the text with a more space-efficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space.
The Lempel-Ziv index (LZ-index) of Navarro is a compressed full-text self-index able to represent \(T\) using \(4uH _{k }(T)+o(u \log \sigma )\) bits of space, where \(H _{k }(T)\) denotes the \(k\)-th order empirical entropy of \(T\), for any \(k=o(\log _{\sigma } u)\). This space is about four times the compressed text size. The index can locate all the \(occ\) occurrences of a pattern \(P\) in \(T\) in \(O(m ^{3} \log \sigma +(m+occ) \log u)\) worst-case time. Although this index has proven very competitive in practice, the \(O(m ^{3} \log \sigma )\) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other state-of-the-art alternatives.
In this paper we present stronger Lempel-Ziv based indices (LZ-indices), improving the overall performance of the original LZ-index. We achieve indices requiring \((2+\epsilon )uH _{k }(T)+o(u \log \sigma )\) bits of space, for any constant \(\epsilon >0\), which makes them the smallest existing LZ-indices. We simultaneously improve the search time to \(O(m ^{2}+(m+occ) \log u)\), which makes our indices very competitive with state-of-the-art alternatives. Our indices support displaying any text substring of length \(\ell \) in optimal \(O(\ell /\log _{\sigma } u)\) time. In addition, we show how the space can be squeezed to \((1+\epsilon )uH _{k }(T)+o(u \log \sigma )\) to obtain a structure with \(O(m ^{2})\) average search time for \(m\geqslant 2\log _{\sigma } u\). Alternatively, the search time of LZ-indices can be improved to \(O((m+occ)\log u)\) with \((3+\epsilon )uH _{k }(T)+o(u \log \sigma )\) bits of space, which is much less than the space needed by other Lempel-Ziv-based indices achieving the same search time. Overall our indices stand out as a very attractive alternative for space-efficient indexed text searching.

MSC:
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
68R15 Combinatorics on words
68W32 Algorithms on strings
68P05 Data structures
Software:
PATRICIA
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words. NATO ISI Series, vol. 1, pp. 85–96. Springer, Berlin (1985) · Zbl 0572.68067
[2] Arroyuelo, D., Navarro, G.: Space-efficient construction of LZ-index. In: Proc. 16th Annual International Symposium on Algorithms and Computation (ISAAC). LNCS, vol. 3827, pp. 1143–1152. Springer, Berlin (2005) · Zbl 1175.68117
[3] Arroyuelo, D., Navarro, G.: A Lempel-Ziv text index on secondary storage. In: Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS, vol. 4580, pp. 83–94. Springer, Berlin (2007) · Zbl 1138.68381
[4] Arroyuelo, D., Navarro, G.: Practical approaches to reduce the space requirement of Lempel-Ziv-based compressed text indices. Technical Report TR/DCC-2008-9, Department of Computer Science, University of Chile, 2008. http://www.dcc.uchile.cl/TR/2008/TR_DCC-2008-009.pdf · Zbl 1284.68253
[5] Arroyuelo, D., Navarro, G.: Space-efficient construction of Lempel-Ziv compressed text indexes. Technical Report TR/DCC-2009-2, Department of Computer Science, University of Chile, 2009. http://www.dcc.uchile.cl/TR/2009/TR_DCC-20090313-002.pdf · Zbl 1220.68051
[6] Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index. In: Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS, vol. 4009, pp. 319–330. Springer, Berlin (2006) · Zbl 1196.68076
[7] Barbay, J., He, M., Munro, J.I., Rao, S.S.: Succinct indexes for strings, binary relations and multi-labeled trees. In: Proc. 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 680–689 (2007) · Zbl 1302.68097
[8] Benoit, D., Demaine, E., Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005) · Zbl 1086.68034 · doi:10.1007/s00453-004-1146-6
[9] Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
[10] Chazelle, B.: A functional approach to data structures and its use in multidimensional searching. SIAM J. Comput. 17(3), 427–462 (1988) · Zbl 0679.68074 · doi:10.1137/0217026
[11] Clark, D., Munro, J.I.: Efficient suffix trees on secondary storage. In: Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 383–391 (1996) · Zbl 0847.68030
[12] Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice!. ACM J. Exp. Algorithmics 13, Article 12 (2009). 30 pages · Zbl 1284.68255
[13] Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: Proc. 46th Annual Symposium on Foundations of Computer Science (FOCS), pp. 184–196 (2005)
[14] Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 390–398 (2000)
[15] Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: Proc. 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 269–278 (2001) · Zbl 1002.68519
[16] Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 54(4), 552–581 (2005) · Zbl 1323.68261 · doi:10.1145/1082036.1082039
[17] Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), Article 20 (2007) · Zbl 1321.68263 · doi:10.1145/1240233.1240243
[18] Ferragina, P., Navarro, G.: Pizza&Chili Corpus–compressed indexes and their testbeds (2005). http://pizzachili.dcc.uchile.cl
[19] Gagie, T.: Large alphabets and incompressibility. Inform. Process. Lett. 99(6), 246–251 (2006) · Zbl 1185.68367 · doi:10.1016/j.ipl.2006.04.008
[20] Golynski, A., Munro, J.I., Rao, S.S.: Rank/select operations on large alphabets: a tool for text indexing. In: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 368–373 (2006) · Zbl 1192.68800
[21] Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003) · Zbl 1092.68584
[22] Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005) · Zbl 1092.68115 · doi:10.1137/S0097539702402354
[23] Hon, W.-K., Lam, T.W., Sadakane, K., Sung, W.-K., Yiu, M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23–36 (2007) · Zbl 1123.68137 · doi:10.1007/s00453-006-1228-8
[24] Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. In: Proc. 44th Annual Symposium on Foundations of Computer Science (FOCS), pp. 251–260 (2003)
[25] Jansson, J., Sadakane, K., Sung, W.-K.: Ultra-succinct representation of ordered trees. In: Proc. 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 575–584 (2007) · Zbl 1302.68100
[26] Kärkkäinen, J.: Repetition-based text indexes. PhD thesis, Dept. of CS, University of Helsinki, Finland, 1999 · Zbl 0940.68063
[27] Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proc. 3rd South American Workshop on String Processing (WSP), pp. 141–155 (1996)
[28] Kosaraju, R., Manzini, G.: Compression of low entropy strings with Lempel-Ziv algorithms. SIAM J. Comput. 29(3), 893–911 (1999) · Zbl 0941.68055 · doi:10.1137/S0097539797331105
[29] Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inform. Theory 22(1), 75–81 (1976) · Zbl 0337.94013 · doi:10.1109/TIT.1976.1055501
[30] Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005) · Zbl 1161.68402
[31] Mäkinen, V., Navarro, G.: Rank and select revisited and extended. Theor. Comp. Sci. 387(3), 332–347 (2007) · Zbl 1144.68023 · doi:10.1016/j.tcs.2007.07.013
[32] Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algorithms 4(3), Article 32 (2008). 38 pages · Zbl 1446.68043 · doi:10.1145/1367064.1367072
[33] Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993) · Zbl 0784.68027 · doi:10.1137/0222058
[34] Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001) · Zbl 1323.68262 · doi:10.1145/382780.382782
[35] Morrison, D.R.: Patricia–practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968) · doi:10.1145/321479.321481
[36] Munro, J.I.: Tables. In: Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). LNCS, vol. 1180, pp. 37–42. Springer, Berlin (1996)
[37] Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Succinct representations of permutations. In: Proc. 30th International Colloquium on Automata, Languages and Computation (ICALP). LNCS, vol. 2719, pp. 345–356. Springer, Berlin (2003) · Zbl 1039.68546
[38] Munro, J.I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31(3), 762–776 (2001) · Zbl 1017.68037 · doi:10.1137/S0097539799364092
[39] Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discrete Algorithms 2(1), 87–114 (2004) · Zbl 1118.68443 · doi:10.1016/S1570-8667(03)00066-2
[40] Navarro, G.: Implementing the LZ-index: theory versus practice. ACM J. Exp. Algorithmics 13, Article 2 (2009). 49 pages · Zbl 1284.68258
[41] Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), Article 2 (2007) · Zbl 1321.68263 · doi:10.1145/1216370.1216372
[42] Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 233–242 (2002) · Zbl 1093.68582
[43] Russo, L., Navarro, G., Oliveira, A.: Approximate string matching with Lempel-Ziv compressed indexes. In: Proc. 14th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS, vol. 4726, pp. 264–275. Springer, Berlin (2007)
[44] Russo, L., Oliveira, A.: A compressed self-index using a Ziv-Lempel dictionary. Inf. Retr. 5(3), 501–513 (2007)
[45] Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003) · Zbl 1100.68563 · doi:10.1016/S0196-6774(03)00087-7
[46] Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1230–1239 (2006) · Zbl 1192.68188
[47] Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th Annual Symposium on Foundations of Computer Science (FOCS), pp. 1–11 (1973)
[48] Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24(5), 530–536 (1978) · Zbl 0392.94004 · doi:10.1109/TIT.1978.1055934
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.