×

zbMATH — the first resource for mathematics

On compressing and indexing repetitive sequences. (English) Zbl 1292.68061
Summary: We introduce LZ-End, a new member of the Lempel-Ziv family of text compressors, which achieves compression ratios close to those of LZ77 but is much faster at extracting arbitrary text substrings. We then build the first self-index based on LZ77 (or LZ-End) compression, which in addition to text extraction offers fast indexed searches on the compressed text. This self-index is particularly effective for representing highly repetitive sequence collections, which arise for example when storing versioned documents, software repositories, periodic publications, and biological sequence databases.

MSC:
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
68W32 Algorithms on strings
Software:
PATRICIA
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] S. Kreft, G. Navarro, LZ77-like compression with fast random access, in: Proc. 20th Data Compression Conference, DCC’10, pp. 239-248.
[2] S. Kreft, G. Navarro, Self-indexing based on LZ77, in: Proc. 22nd Annual Symposium on Combinatorial Pattern Matching, CPM’11, in: LNCS, vol. 6661, pp. 41-54. · Zbl 1339.68334
[3] Ziviani, N.; de Moura, E. S.; Navarro, G.; Baeza-Yates, R., Compression: a key for next-generation text retrieval systems, IEEE Computer, 33, 37-44, (2000)
[4] Navarro, G.; Mäkinen, V., Compressed full-text indexes, ACM Computing Surveys, 39, (2007), (article 2) · Zbl 1321.68263
[5] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st Annual Symposium on Foundations of Computer Science, FOCS’00, pp. 390-398.
[6] R. Grossi, J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text indexing and string matching, in: Proc. 32nd Annual ACM Symposium on Theory of Computing, STOC’00, pp. 397-406. · Zbl 1296.68035
[7] Apostolico, A., The myriad virtues of subword trees, (Combinatorial Algorithms on Words, NATO ISI Series, (1985), Springer-Verlag), 85-96 · Zbl 0572.68067
[8] Manber, U.; Myers, G., Suffix arrays: a new method for on-line string searches, SIAM Journal on Computing, 22, 935-948, (1993) · Zbl 0784.68027
[9] Manzini, G., An analysis of the Burrows-Wheeler transform, Journal of the ACM, 48, 407-430, (2001) · Zbl 1323.68262
[10] R. Grossi, A. Gupta, J.S. Vitter, High-order entropy-compressed text indexes, in: Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’03, pp. 841-850. · Zbl 1092.68584
[11] Ferragina, P.; Manzini, G.; Mäkinen, V.; Navarro, G., Compressed representations of sequences and full-text indexes, ACM Transactions on Algorithms, 3, (2007), article 20 · Zbl 1321.68263
[12] Gagie, T., Large alphabets and incompressibility, Information Processing Letters, 99, 246-251, (2006) · Zbl 1185.68367
[13] C. Nevill-Manning, I. Witten, D. Maulsby, Compression by induction of hierarchical grammars, in: Proc. 4th Data Compression Conference, DCC’94, pp. 244-253.
[14] Larsson, N. J.; Moffat, A., Off-line dictionary-based compression, Proceedings of the IEEE, 88, 1722-1732, (2000)
[15] Lempel, A.; Ziv, J., On the complexity of finite sequences, IEEE Transactions on Information Theory, 22, 75-81, (1976) · Zbl 0337.94013
[16] Ziv, J.; Lempel, A., A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, 23, 337-343, (1977) · Zbl 0379.94010
[17] Ziv, J.; Lempel, A., Compression of individual sequences via variable-rate coding, IEEE Transactions on Information Theory, 24, 530-536, (1978) · Zbl 0392.94004
[18] F. Claude, G. Navarro, Self-indexed text compression using straight-line programs, in: Proc. 34th International Symposium on Mathematical Foundations of Computer Science, MFCS’09, in: LNCS, vol. 5734, pp. 235-246. · Zbl 1233.68132
[19] P. Bille, G.M. Landau, R. Raman, K. Sadakane, S.R. Satti, O. Weimann, Random access to grammar-compressed strings, in: Proc. 22nd Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’11, pp. 373-389. · Zbl 1375.68229
[20] S. Kuruppu, B. Beresford-Smith, T. Conway, J. Zobel, Repetition-based compression of large DNA datasets, in: Proc. 13th Annual International Conference on Computational Molecular Biology, RECOMB’09, Poster.
[21] F. Claude, A. Fariña, M. Martínez-Prieto, G. Navarro, Compressed \(q\)-gram indexing for highly repetitive biological sequences, in: Proc. 10th IEEE Conference on Bioinformatics and Bioengineering, BIBE’10, pp. 86-91.
[22] F. Claude, A. Fariña, M. Martínez-Prieto, G. Navarro, Indexes for highly repetitive document collections, in: Proc. 20th ACM International Conference on Information and Knowledge Management, CIKM’11, pp. 463-468.
[23] K. Sadakane, R. Grossi, Squeezing succinct data structures into entropy bounds, in: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’06, pp. 1230-1239. · Zbl 1192.68188
[24] Navarro, G., Indexing text using the ziv – lempel trie, Journal of Discrete Algorithms, 2, 87-114, (2004) · Zbl 1118.68443
[25] Ferragina, P.; Manzini, G., Indexing compressed text, Journal of the ACM, 52, 552-581, (2005) · Zbl 1323.68261
[26] D. Arroyuelo, G. Navarro, K. Sadakane, Reducing the space requirement of LZ-index, Proc. 17th Annual Symposium on Combinatorial Pattern Matching, CPM, in: LNCS, vol. 4009, pp. 319-330. · Zbl 1196.68076
[27] Russo, L.; Oliveira, A., A compressed self-index using a ziv – lempel dictionary, Information Retrieval, 5, 501-513, (2008)
[28] S. Kuruppu, S. Puglisi, J. Zobel, Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval, in: Proc. 17th International Symposium on String Processing and Information Retrieval, SPIRE’10, pp. 201-206. · Zbl 1397.68073
[29] J. Sirén, N. Välimäki, V. Mäkinen, G. Navarro, Run-length compressed indexes are superior for highly repetitive sequence collections, in: Proc. 15th International Symposium on String Processing and Information Retrieval, SPIRE’08, in: LNCS, vol. 5280, pp. 164-175. · Zbl 1345.68124
[30] Mäkinen, V.; Navarro, G.; Sirén, J.; Välimäki, N., Storage and retrieval of highly repetitive sequence collections, Journal of Computational Biology, 17, 281-308, (2010)
[31] M. Burrows, D. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994.
[32] Grossi, R.; Vitter, J. S., Compressed suffix arrays and suffix trees with applications to text indexing and string matching, SIAM Journal of Computing, 35, 378-407, (2005) · Zbl 1092.68115
[33] Mäkinen, V.; Navarro, G., Succinct suffix arrays based on run-length encoding, Nordic Journal of Computing, 12, 40-66, (2005) · Zbl 1085.68031
[34] Sadakane, K., New text indexing functionalities of the compressed suffix arrays, Journal of Algorithms, 48, 294-313, (2003) · Zbl 1100.68563
[35] Rytter, W., Application of lempel – ziv factorization to the approximation of grammar-based compression, Theoretical Computer Science, 302, 211-222, (2003) · Zbl 1051.68088
[36] J. Kärkkäinen, E. Ukkonen, Lempel-Ziv parsing and sublinear-size index structures for string matching, in: Proc. 3rd South American Workshop on String Processing, WSP’96, pp. 141-155.
[37] J. Kärkkäinen, Repetition-based text indexes, Ph.D. Thesis, Department of Computer Science, University of Helsinki, Finland, 1999.
[38] S. Kreft, Self-Index based on LZ77, MSc Thesis, University of Chile, 2010. Available as Tech. Report TR/DCC-2011-13, Dept. of Computer Science, University of Chile. http://www.dcc.uchile.cl/TR/2011/DCC-20111220-013.pdf.
[39] M. Farach, M. Thorup, String matching in Lempel-Ziv compressed strings, in: Proc. 27th ACM Annual Symposium on the Theory of Computing, STOC, pp. 703-712. · Zbl 0978.68520
[40] Kosaraju, S. R.; Manzini, G., Compression of low entropy strings with lempel – ziv algorithms, SIAM Journal on Computing, 29, 893-911, (1999) · Zbl 0941.68055
[41] Plotnik, E.; Weinberger, M.; Ziv, J., Upper bounds on the probability of sequences emitted by finite-state sources and on the redundancy of the lempel – ziv algorithm, IEEE Transactions on Information Theory, 38, 66-72, (1992) · Zbl 0745.94004
[42] Hamming, R., Coding and information theory, (1986), Prentice-Hall · Zbl 0431.94001
[43] Fiala, E. R.; Greene, D. H., Data compression with finite windows, Communications of the ACM, 32, 490-505, (1989)
[44] R. Raman, V. Raman, S.S. Rao, Succinct indexable dictionaries with applications to encoding \(k\)-ary trees and multisets, in: Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’02, pp. 233-242. · Zbl 1093.68582
[45] J.I. Munro, R. Raman, V. Raman, S.S. Rao, Succinct representations of permutations, in: Proc. 30th International Colloquium on Automata, Languages and Computation, ICALP’03, in: LNCS, vol. 2719, pp. 345-356. · Zbl 1039.68546
[46] Morrison, D., PATRICIA—practical algorithm to retrieve information coded in alphanumeric, Journal of the ACM, 15, 514-534, (1968)
[47] Benoit, D.; Demaine, E.; Munro, I.; Raman, R.; Raman, V.; Rao, S., Representing trees of higher degree, Algorithmica, 43, 275-292, (2005) · Zbl 1086.68034
[48] Mäkinen, V.; Navarro, G., Rank and select revisited and extended, Theoretical Computer Science, 387, 332-347, (2007) · Zbl 1144.68023
[49] M. Paˇtraşcu, Succincter, in: Proc. 49th IEEE Annual Symposium on Foundations of Computer Science, FOCS’08, pp. 305-313.
[50] N. Brisaboa, S. Ladra, G. Navarro, Directly addressable variable-length codes, in: Proc. 16th International Symposium on String Processing and Information Retrieval, SPIRE’09, pp. 122-130.
[51] D. Arroyuelo, R. Cánovas, G. Navarro, K. Sadakane, Succinct trees in practice, in: Proc. 11th Workshop on Algorithm Engineering and Experiments, ALENEX’10, pp. 84-97.
[52] J. Fischer, Optimal succinctness for range minimum queries, in: Proceedings of the Latin American Symposium on Theoretical Informatics, LATIN’10, in: LNCS, vol. 6034, pp. 158-169. · Zbl 1283.68141
[53] S. Muthukrishnan, Efficient algorithms for document retrieval problems, in: Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’02, pp. 657-666. · Zbl 1093.68588
[54] S. Gog, J. Fischer, Advantages of shared data structures for sequences of balanced parentheses, in: Proc. 20th Data Compression Conference, DCC’10, pp. 406-415.
[55] Fischer, J.; Mäkinen, V.; Navarro, G., Faster entropy-bounded compressed suffix trees, Theoretical Computer Science, 410, 5354-5364, (2009) · Zbl 1187.68171
[56] M. Crochemore, C.S. Iliopoulos, M. Kubica, M.S. Rahman, T. Walen, Improved algorithms for the range next value problem and applications, in: Proc. 25th International Symposium on Theoretical Aspects of Computer Science, STACS’08, pp. 205-216. · Zbl 1259.68226
[57] D. Okanohara, K. Sadakane, Practical entropy-compressed rank/select dictionary, in: Proc. 9th Workshop on Algorithm Engineering and Experiments, ALENEX’07.
[58] Chen, G.; Puglisi, S. J.; Smyth, W. F., Lempel-Ziv factorization using less time & space, Mathematics in Computer Science, 1, 605-623, (2008) · Zbl 1181.68315
[59] J. Kärkkäinen, P. Sanders, Simple linear work suffix array construction, in: Proc. 30th International Colloquium on Automata, Languages and Programming, ICALP’03, in: LNCS, vol. 2719, pp. 943-955. · Zbl 1039.68042
[60] Willard, D., Log – logarithmic worst-case range queries are possible in space \(\Theta(n)\), Information Processing Letters, 17, 81-84, (1983) · Zbl 0509.68106
[61] G. Navarro, K. Sadakane, Fully-functional static and dynamic succinct trees, CoRR 0905.0768v5 (2010). · Zbl 1288.05046
[62] Munro, J. I., An implicit data structure supporting insertion, deletion, and search in \(O(\log n)\) time, Journal of Computer System Sciences, 33, 66-74, (1986) · Zbl 0625.68044
[63] Arroyuelo, D.; Navarro, G.; Sadakane, K., Stronger lempel – ziv based compressed text indexing, Algorithmica, 62, 1, 54-101, (2012) · Zbl 1241.68061
[64] D. Okanohara, K. Sadakane, An online algorithm for finding the longest previous factors, in: Proc. 16th Annual European Symposium on Algorithms, ESA’08, pp. 696-707. · Zbl 1158.68556
[65] Navarro, G., Implementing the LZ-index: theory versus practice, ACM Journal of Experimental Algorithmics, 13, (2009), (article 2) · Zbl 1284.68258
[66] J. Fischer, V. Heun, A new succinct representation of RMQ-information and improvements in the enhanced suffix array, in: Proc. 1st International Symposium on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, ESCAPE’07, in: LNCS, vol. 4614, pp. 459-470. · Zbl 1176.68058
[67] Claude, F.; Navarro, G., Self-indexed grammar-based compression, Fundamenta Informaticae, 111, 313-337, (2010) · Zbl 1237.68072
[68] Hon, W.-K.; Sadakane, K.; Sung, W.-K., Breaking a time-and-space barrier in constructing full-text indices, SIAM Journal on Computing, 38, 2162-2178, (2009) · Zbl 1191.68225
[69] L.M.S. Russo, G. Navarro, A.L. Oliveira, Fully-compressed suffix trees, in: 8th Latin American Symposium on Theoretical Informatics, LATIN’08, in: LNCS, vol. 4957, pp. 362-373. · Zbl 1136.68369
[70] D. Arroyuelo, G. Navarro, Smaller and faster Lempel-Ziv indices, in: Proc. 18th International Workshop on Combinatorial Algorithms, IWOCA’07, pp. 11-20.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.