×

Lempel-Ziv-like parsing in small space. (English) Zbl 1460.68038

Summary: Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel-Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the \(k\) th order empirical entropy compression \(n H_k + o(n\log \sigma )\) with \(k = o(\log_\sigma n)\), where \(n\) is the input length and \(\sigma\) is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be \(\Omega (\log n)\) times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below 2.0 in all tested scenarios, and sometimes below 1.05, to the size of LZ.

MSC:

68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
68W32 Algorithms on strings
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Alakuijala, J.; Farruggia, A.; Ferragina, P.; Kliuchnikov, E.; Obryk, R.; Szabadka, Z.; Vandevenne, L., Brotli: a general-purpose data compressor, ACM Trans. Inf. Syst., 37, 1, 4 (2018)
[2] Amir, A.; Landau, GM; Ukkonen, E., Online timestamped text indexing, Inf. Process. Lett., 82, 5, 253-259 (2002) · Zbl 1338.68276
[3] Bannai, H., Gagie, T., I, T.: Online LZ77 parsing and matching statistics with RLBWTs. In: Proceedings of the CPM 2018, LIPIcs, vol. 105, pp. 7:1-7:12. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2018). 10.4230/LIPIcs.CPM.2018.7
[4] Belazzougui, D., Puglisi, S.J.: Range predecessor and Lempel-Ziv parsing. In: Proceedings of the SODA 2016, pp. 2053-2071. SIAM (2016). 10.1137/1.9781611974331.ch143 · Zbl 1410.68116
[5] Bille, P., Cording, P.H., Fischer, J., Gørtz, I.L.: Lempel-Ziv compression in a sliding window. In: Proceedings of the CPM 2017, LIPIcs, vol. 78. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2017). 10.4230/LIPIcs.CPM.2017.15 · Zbl 1434.68723
[6] Deorowicz, S.; Danek, A.; Niemiec, M., GDC 2: compression of large collections of genomes, Sci. Rep., 5, 11565 (2015)
[7] Deorowicz, S.; Grabowski, S., Robust relative compression of genomes with random access, Bioinformatics, 27, 21, 2979-2986 (2011)
[8] Duda, J.: Asymmetric numeral systems as close to capacity low state entropy coders. CoRR abs/1311.2540 (2013). http://arxiv.org/abs/1311.2540
[9] Elias, P., Universal codeword sets and representations of the integers, IEEE Trans. Inf. Theory, 21, 2, 194-203 (1975) · Zbl 0298.94011
[10] Ferragina, P.; Nitto, I.; Venturini, R., On the bit-complexity of Lempel-Ziv compression, SIAM J. Comput., 42, 4, 1521-1541 (2013) · Zbl 1276.68069
[11] Fischer, J., Gagie, T., Gawrychowski, P., Kociumaka, T.: Approximating LZ77 via small-space multiple-pattern matching. In: Proceedings of the ESA 2015, LNCS, vol. 9294, pp. 533-544. Springer (2015). 10.1007/978-3-662-48350-3_45 · Zbl 1466.68090
[12] Gagie, T., Large alphabets and incompressibility, Inf. Process. Lett., 99, 6, 246-251 (2006) · Zbl 1185.68367
[13] Gagie, T., Manzini, G.: Space-conscious compression. In: Proc. MFCS 2007, LNCS, vol. 4708, pp. 206-217. Springer (2007). 10.1007/978-3-540-74456-6_20 · Zbl 1147.68513
[14] Gagie, T., Navarro, G., Prezza, N.: On the approximation ratio of Lempel-Ziv parsing. In: Proceedings of the LATIN 2018, LNCS, vol. 10807, pp. 490-503. Springer (2018). 10.1007/978-3-319-77404-6_36 · Zbl 1485.68129
[15] Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the SODA 2018, pp. 1459-1477. SIAM (2018). 10.1137/1.9781611975031.96 · Zbl 1403.68051
[16] Gagie, T., Puglisi, S.J., Valenzuela, D.: Analyzing relative Lempel-Ziv reference construction. In: Proceedings of the SPIRE 2016, LNCS, vol. 9954, pp. 160-165. Springer (2016). 10.1007/978-3-319-46049-9_16 · Zbl 1397.68071
[17] Gańczorz, M.: Entropy bounds for grammar compression. CoRR abs/1804.08547 (2018). http://arxiv.org/abs/1804.08547
[18] Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: Plug and play with succinct data structures. In: Proceedings of the SEA 2014, LNCS, vol. 8504, pp. 326-337. Springer (2014). 10.1007/978-3-319-07959-2_28
[19] Hoobin, C.; Puglisi, SJ; Zobel, J., Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections, Proc. VLDB Endow., 5, 3, 265-273 (2011)
[20] Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time Lempel-Ziv factorization: Simple, fast, small. In: Proceedings of the CPM 2013, LNCS, vol. 7922, pp. 189-200. Springer (2013). 10.1007/978-3-642-38905-4_19 · Zbl 1381.68317
[21] Karkkainen, J., Kempa, D., Puglisi, S.J.: Lempel-Ziv parsing in external memory. In: Proceedings of the DCC 2014, pp. 153-162. IEEE (2014). 10.1109/DCC.2014.78
[22] Kempa, D., Kosolobov, D.: LZ-End parsing in compressed space. In: Proceedings of the DCC 2017, pp. 350-359. IEEE (2017). 10.1109/DCC.2017.73 · Zbl 1442.68050
[23] Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of the STOC 2018, pp. 827-840. ACM (2018). 10.1145/3188745.3188814 · Zbl 1418.68085
[24] Kosaraju, SR; Manzini, G., Compression of low entropy strings with Lempel-Ziv algorithms, SIAM J. Comput., 29, 3, 893-911 (1999) · Zbl 0941.68055
[25] Kosolobov, D.: Relations between greedy and bit-optimal LZ77 encodings. In: Proceedings of the STACS 2018, LIPIcs, vol. 96, pp. 46:1-46:14. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2018). 10.4230/LIPIcs.STACS.2018.46 · Zbl 1487.68101
[26] Kreft, S., Navarro, G.: LZ77-like compression with fast random access. In: Proceedings of the DCC 2010, pp. 239-248. IEEE (2010). 10.1109/DCC.2010.29
[27] Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Proceedings of the SPIRE 2010, LNCS, vol. 6393, pp. 201-206. Springer (2010). 10.1007/978-3-642-16321-0_20 · Zbl 1397.68073
[28] Kuruppu, S., Puglisi, S.J., Zobel, J.: Optimized relative Lempel-Ziv compression of genomes. In: Australasian Computer Science Conference, pp. 91-98. Australian Computer Society, Inc. (2011) · Zbl 1397.68073
[29] Larsson, N.J.: Most recent match queries in on-line suffix trees. In: Proceedings of the CPM 2014, LNCS, vol. 8486, pp. 252-261 (2014). 10.1007/978-3-319-07566-2_26 · Zbl 1407.68112
[30] Larsson, NJ; Sadakane, K., Faster suffix sorting, Theor. Comput. Sci., 387, 3, 258-272 (2007) · Zbl 1144.68022
[31] Lemire, D.; Boytsov, L., Decoding billions of integers per second through vectorization, Softw. Pract. Exp., 45, 1, 1-29 (2015)
[32] Levenshtein, VI, On the redundancy and delay of decodable coding of natural numbers, Syst. Theory Res., 20, 149-155 (1968)
[33] Liao, K., Petri, M., Moffat, A., Wirth, A.: Effective construction of relative Lempel-Ziv dictionaries. In: Proceedings of the WWW 2016, pp. 807-816. International World Wide Web Conferences Steering Committee (2016). 10.1145/2872427.2883042
[34] Mäkinen, V.; Navarro, G., Compressed full-text indexes, ACM Comput. Surv., 39, 1, 2 (2007)
[35] Manzini, G., An analysis of the Burrows-Wheeler transform, J. ACM, 48, 3, 407-430 (2001) · Zbl 1323.68262
[36] Navarro, G.: Indexing highly repetitive collections. In: Proceedings of the IWOCA 2012, LNCS, vol. 7643, pp. 274-279 (2012). 10.1007/978-3-642-35926-2_29 · Zbl 1293.68087
[37] Ochoa, C.; Navarro, G., RePair and all irreducible grammars are upper bounded by high-order empirical entropy, IEEE Trans. Inf. Theory (2018) · Zbl 1432.68213
[38] Policriti, A., Prezza, N.: Fast online Lempel-Ziv factorization in compressed space. In: Proceedings of the SPIRE 2015, LNCS, vol. 9309, pp. 13-20. Springer (2015). 10.1007/978-3-319-23826-5_2
[39] Policriti, A.; Prezza, N., LZ77 computation based on the run-length encoded BWT, Algorithmica, 80, 7, 1986-2011 (2018) · Zbl 1392.68186
[40] Puglisi, SJ; Kao, M-Y, Lempel-Ziv compression, Encyclopedia of algorithms, 1095-1100 (2016), New York: Springer, New York
[41] Shields, PC, Performance of LZ algorithms on individual sequences, IEEE Trans. Inf. Theory, 45, 4, 1283-1288 (1999) · Zbl 0957.94011
[42] Storer, JA; Szymanski, TG, Data compression via textual substitution, J. ACM, 29, 4, 928-951 (1982) · Zbl 0489.68041
[43] Tillson, TW, A hamiltonian decomposition of \(K^*_{2m}, 2m \ge 8\), J. Combin. Theory Ser. B, 29, 1, 68-74 (1980) · Zbl 0439.05025
[44] Valenzuela, D.: CHICO: A compressed hybrid index for repetitive collections. In: Proceedings of the SEA 2016, LNCS, vol. 9685, pp. 326-338. Springer (2016). 10.1007/978-3-319-38851-9_22
[45] Wandelt, S.; Leser, U., FRESCO: referential compression of highly similar sequences, IEEE/ACM Trans. Comput. Biol. Bioinf., 10, 5, 1275-1288 (2013)
[46] Wyner, AJ, The redundancy and distribution of the phrase lengths of the fixed-database Lempel-Ziv algorithm, IEEE Trans. Inf. Theory, 43, 5, 1452-1464 (1997) · Zbl 1053.94543
[47] Yann Collet: Zstandard. (2016). Retrieved from: https://facebook.github.io/zstd/. Accessed 2018-09-17
[48] Yuta Mori: libdivsufsort. https://github.com/y-256/libdivsufsort/. Accessed 22 May 2020
[49] Ziv, J.; Lempel, A., On the complexity of finite sequences, IEEE Trans. Inf. Theory, 22, 1, 75-81 (1976) · Zbl 0337.94013
[50] Ziv, J.; Lempel, A., A universal algorithm for sequential data compression, IEEE Trans. Inf. Theory, 23, 3, 337-343 (1977) · Zbl 0379.94010
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.