×

zbMATH — the first resource for mathematics

Self-indexing based on LZ77. (English) Zbl 1339.68334
Giancarlo, Raffaele (ed.) et al., Combinatorial pattern matching. 22nd annual symposium, CPM 2011, Palermo, Italy, June 27–29, 2011. Proceedings. Berlin: Springer (ISBN 978-3-642-21457-8/pbk). Lecture Notes in Computer Science 6661, 41-54 (2011).
Summary: We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as little as 2.5 times), extracts 1–2 million characters of the text per second, and finds patterns at a rate of 10–50 microseconds per occurrence. It is smaller (up to one half) than the best current self-index for repetitive collections, and faster in many cases.
For the entire collection see [Zbl 1216.68028].

MSC:
68W32 Algorithms on strings
68P15 Database theory
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
92D10 Genetics and epigenetics
Software:
PATRICIA
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: ALENEX, pp. 84–97 (2010) · doi:10.1137/1.9781611972900.9
[2] Benoit, D., Demaine, E., Munro, I., Raman, R., Raman, V., Rao, S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005) · Zbl 1086.68034 · doi:10.1007/s00453-004-1146-6
[3] Brisaboa, N.R., Ladra, S., Navarro, G.: Directly addressable variable-length codes. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 122–130. Springer, Heidelberg (2009) · Zbl 05609205 · doi:10.1007/978-3-642-03784-9_12
[4] Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. TRep. 124, DEC (1994)
[5] Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Compressed q-gram indexing for highly repetitive biological sequences. In: BIBE, pp. 86–91 (2010) · doi:10.1109/BIBE.2010.22
[6] Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005) · Zbl 1323.68261 · doi:10.1145/1082036.1082039
[7] Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Alg. 3(2), article 20 (2007) · Zbl 1321.68263
[8] Fischer, J.: Optimal succinctness for range minimum queries. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 158–169. Springer, Heidelberg (2010) · Zbl 1283.68141 · doi:10.1007/978-3-642-12200-2_16
[9] Gog, S., Fischer, J.: Advantages of shared data structures for sequences of balanced parentheses. In: DCC, pp. 406–415 (2010) · doi:10.1109/DCC.2010.43
[10] Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: SODA, pp. 841–850 (2003) · Zbl 1092.68584
[11] Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: STOC, pp. 397–406 (2000) · Zbl 1296.68035 · doi:10.1145/335305.335351
[12] He, J., Zeng, J., Suel, T.: Improved index compression techniques for versioned document collections. In: CIKM, pp. 1239–1248 (2010) · doi:10.1145/1871437.1871594
[13] Kärkkäinen, J.: Repetition-Based Text Indexes. Ph.D. thesis, Univ. Helsinki, Finland (1999) · Zbl 0940.68063
[14] Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: WSP, pp. 141–155 (1996)
[15] Kreft, S.: Self-Index based on LZ77. MSc thesis, Univ. of Chile (2010), http://www.dcc.uchile.cl/gnavarro/algoritmos/tesisKreft.pdf · Zbl 1339.68334
[16] Kreft, S., Navarro, G.: LZ77-like compression with fast random access. In: DCC, pp. 239–248 (2010) · doi:10.1109/DCC.2010.29
[17] Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Repetition-based compression of large DNA datasets. In: RECOMB (2009), poster
[18] Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010) · Zbl 1397.68073 · doi:10.1007/978-3-642-16321-0_20
[19] Mäkinen, V., Navarro, G.: Rank and select revisited and extended. Theo.Comp.Sci. 387(3), 332–347 (2007) · Zbl 1144.68023 · doi:10.1016/j.tcs.2007.07.013
[20] Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comp. Biol. 17(3), 281–308 (2010) · doi:10.1089/cmb.2009.0169
[21] Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001) · Zbl 1323.68262 · doi:10.1145/382780.382782
[22] Morrison, D.: PATRICIA-Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968) · doi:10.1145/321479.321481
[23] Munro, I., Raman, R., Raman, V., Rao, S.: Succinct representations of permutations. In: ICALP, pp. 345–356 (2003) · Zbl 1039.68546 · doi:10.1007/3-540-45061-0_29
[24] Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: SODA, pp. 657–666 (2002) · Zbl 1093.68588
[25] Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discr. Alg. 2(1), 87–114 (2004) · Zbl 1118.68443 · doi:10.1016/S1570-8667(03)00066-2
[26] Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), article 2 (2007) · Zbl 1321.68263
[27] Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: ALENEX (2007) · doi:10.1137/1.9781611972870.6
[28] Pǎtraşcu, M.: Succincter. In: FOCS, pp. 305–313 (2008)
[29] Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: SODA, pp. 233–242 (2002) · Zbl 1093.68582
[30] Russo, L., Oliveira, A.: A compressed self-index using a Ziv-Lempel dictionary. Inf. Retr. 5(3), 501–513 (2008)
[31] Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Alg. 48(2), 294–313 (2003) · Zbl 1100.68563 · doi:10.1016/S0196-6774(03)00087-7
[32] Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008) · Zbl 1345.68124 · doi:10.1007/978-3-540-89097-3_17
[33] Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theo. 23(3), 337–343 (1977) · Zbl 0379.94010 · doi:10.1109/TIT.1977.1055714
[34] Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theo. 24(5), 530–536 (1978) · Zbl 0392.94004 · doi:10.1109/TIT.1978.1055934
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.