×

zbMATH — the first resource for mathematics

Self-indexed text compression using straight-line programs. (English) Zbl 1233.68132
Královič, Rastislav (ed.) et al., Mathematical foundations of computer science 2009. 34th international symposium, MFCS 2009, Novy Smokovec, High Tatras, Slovakia, August 24–28, 2009. Proceedings. Berlin: Springer (ISBN 978-3-642-03815-0/pbk). Lecture Notes in Computer Science 5734, 235-246 (2009).
Summary: Straight-line programs (SLPs) offer powerful text compression by representing a text \(T[1,u]\) in terms of a restricted context-free grammar of \(n\) rules, so that \(T\) can be recovered in \(O(u)\) time. However, the problem of operating the grammar in compressed form has not been studied much. We present a grammar representation whose size is of the same order as that of a plain SLP representation, and we can answer other queries apart from expanding nonterminals. This can be of independent interest. We then extend the grammar representation to achieve the first grammar representation able to extract text substrings, and to search the text for patterns in time \(o(n)\). We also give byproducts on the representation of binary relations.
For the entire collection see [Zbl 1173.68012].

MSC:
68P15 Database theory
68Q42 Grammars and rewriting systems
Software:
PATRICIA
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Amir, A., Benson, G.: Efficient two-dimensional compressed matching. In: Proc. 2nd DCC, pp. 279–288 (1992) · doi:10.1109/DCC.1992.227453
[2] Barbay, J., Golynski, A., Munro, I., Rao, S.S.: Adaptive searching in succinctly encoded binary relations and tree-structured documents. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 24–35. Springer, Heidelberg (2006) · Zbl 1144.68307 · doi:10.1007/11780441_4
[3] Bender, M., Farach-Colton, M.: The level ancestor problem simplified. Theor. Comp. Sci. 321(1), 5–12 (2004) · Zbl 1068.68047 · doi:10.1016/j.tcs.2003.05.002
[4] Benoit, D., Demaine, E., Munro, I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005) · Zbl 1086.68034 · doi:10.1007/s00453-004-1146-6
[5] Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE TIT 51(7), 2554–2576 (2005) · Zbl 1296.68086
[6] Clark, D.: Compact Pat Trees. PhD thesis, University of Waterloo (1996)
[7] Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000) · Zbl 1094.68694 · doi:10.1145/355541.355547
[8] Ferragina, P., Manzini, G.: Indexing compressed texts. J. ACM 52(4), 552–581 (2005) · Zbl 1323.68261 · doi:10.1145/1082036.1082039
[9] Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Alg. 3(2), 20 (2007) · Zbl 1321.68263 · doi:10.1145/1240233.1240243
[10] Gasieniec, L., Kolpakov, R., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: Proc. 15th DCC, p. 458 (2005) · doi:10.1109/DCC.2005.78
[11] Gasieniec, L., Potapov, I.: Time/space efficient compressed pattern matching. Fund. Inf. 56(1-2), 137–154 (2003) · Zbl 1030.68071
[12] Golynski, A., Munro, I., Rao, S.: Rank/select operations on large alphabets: a tool for text indexing. In: Proc. 17th SODA, pp. 368–373 (2006) · Zbl 1192.68800 · doi:10.1145/1109557.1109599
[13] Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th SODA, pp. 841–850 (2003) · Zbl 1092.68584
[14] Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 943–955. Springer, Heidelberg (2003) · Zbl 1039.68042 · doi:10.1007/3-540-45061-0_73
[15] Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proc. 3rd WSP, pp. 141–155. Carleton University Press (1996)
[16] Karpinski, M., Rytter, W., Shinohara, A.: An efficient pattern-matching algorithm for strings with short descriptions. Nordic J. Comp. 4(2), 172–186 (1997) · Zbl 0874.68087
[17] Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching. Theor. Comp. Sci. 298(1), 253–272 (2003) · Zbl 1038.68045 · doi:10.1016/S0304-3975(02)00426-7
[18] Kieffer, J., Yang, E.-H.: Grammar-based codes: A new class of universal lossless source codes. IEEE TIT 46(3), 737–754 (2000) · Zbl 1001.94019
[19] Larsson, J., Moffat, A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000) · doi:10.1109/5.892708
[20] Mäkinen, V., Navarro, G.: Rank and select revisited and extended. Theor. Comp. Sci. 387(3), 332–347 (2007) · Zbl 1144.68023 · doi:10.1016/j.tcs.2007.07.013
[21] Morrison, D.: PATRICIA – practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968) · doi:10.1145/321479.321481
[22] Munro, J., Raman, R., Raman, V., Rao, S.S.: Succinct representations of permutations. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 345–356. Springer, Heidelberg (2003) · Zbl 1039.68546 · doi:10.1007/3-540-45061-0_29
[23] Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discr. Alg. 2(1), 87–114 (2004) · Zbl 1118.68443 · doi:10.1016/S1570-8667(03)00066-2
[24] Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), 2 (2007) · Zbl 1321.68263 · doi:10.1145/1216370.1216372
[25] Nevill-Manning, C., Witten, I., Maulsby, D.: Compression by induction of hierarchical grammars. In: Proc. 4th DCC, pp. 244–253 (1994) · doi:10.1109/DCC.1994.305932
[26] Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proc. 13th SODA, pp. 233–242 (2002) · Zbl 1093.68582
[27] Russo, L., Oliveira, A.: A compressed self-index using a Ziv-Lempel dictionary. Inf. Retr. 11(4), 359–388 (2008) · Zbl 05343849 · doi:10.1007/s10791-008-9050-3
[28] Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comp. Sci. 302(1-3), 211–222 (2003) · Zbl 1051.68088 · doi:10.1016/S0304-3975(02)00777-6
[29] Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008) · Zbl 1345.68124 · doi:10.1007/978-3-540-89097-3_17
[30] Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE TIT 23(3), 337–343 (1977) · Zbl 0379.94010
[31] Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE TIT 24(5), 530–536 (1978) · Zbl 0392.94004
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.