×

zbMATH — the first resource for mathematics

An experimental study of a compressed index. (English) Zbl 1031.68536
Summary: We present an implementation of that index and perform an extensive set of experiments on various text collections. These experiments show that the proposed index is compact (its space occupancy is close to the one achieved by the best known compressors), it is fast in counting the number of pattern occurrences, and the cost of their retrieval is reasonable when they are few (i.e., in case of a selective query). In addition, our experiments show that the index is flexible in that it is possible to trade space occupancy for search time by choosing the amount of auxiliary information stored into it.

MSC:
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
Software:
Canterbury; GLIMPSE
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Amir, A.; Benson, G.; Farach, M., Let sleeping files Lie: pattern matching in Z-compressed files, J. comput. syst. sci., 52, 2, 299-307, (1996) · Zbl 1152.68436
[2] R. Arnold, T. Bell, The Canterbury corpus home page, http://corpus.canterbury.ac.nz
[3] Baeza-Yates, R.; Navarro, G., Block addressing indices for approximate text retrieval, J. amer. soc. inf. sci., 51, 1, 69-82, (2000)
[4] Bentley, J.; Sleator, D.; Tarjan, R.; Wei, V., A locally adaptive compression scheme, Commun. ACM, 29, 4, 320-330, (1986) · Zbl 0648.94007
[5] J.L. Bentley, R. Sedgewick, Fast algorithms for sorting and searching strings, in: Proceedings of the Eighth ACM-SIAM Symposium on Discrete Algorithms, 1997, pp. 360-369 · Zbl 1321.68549
[6] M. Burrows, D. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994
[7] M. Crochemore, F. Mignosi, A. Restivo, S. Salemi, Text compression using antidictionaries, in: Proceedings of International Colloquium on Automata and Languages (ICALP ’99), Lecture Notes in Computer Science, vol. 1644, Springer, Berlin, 1999, pp. 261-270
[8] Farach, M.; Thorup, M., String matching in Lempel-Ziv compressed strings, Algorithmica, 20, 4, 388-404, (1998) · Zbl 0899.68046
[9] Fenwick, P., The burrows – wheeler transform for block sorting text compression: principles and improvements, Comput. J., 39, 9, 731-740, (1996)
[10] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proceedings of the 41st IEEE Symposium on Foundations of Computer Science, Redondo Beach, CA, 2000, pp. 390-398
[11] G.H. Gonnet, R.A. Baeza-Yates, T. Snider, Information Retrieval: Data Structures and Algorithms, Prentice-Hall, Englewood Cliffs, NJ, 1992, pp. 66-82 (Chapter 5)
[12] R. Grossi, J. Vitter, Compressed suffix arrays and suffix trees with applications to text indexing and string matching, in: Proceedings of the 32nd ACM Symposium on Theory of Computing, Portland, OR, 2000, pp. 397-406 · Zbl 1296.68035
[13] D.K. Harman (Ed.), Proceedings of the TREC Text Retrieval Conference, National Institute of Standards, 1992. Special Pubblication 500-207
[14] D.E. Knuth, Sorting and Searching. The Art of Computer Programming, vol. 3, 2nd ed., Addison-Wesley, Reading, MA, USA, 1998 · Zbl 0883.68015
[15] V. Makinen, Compact suffix array, in: Proceedings of the 11th Symposium on Combinatorial Pattern Matching, Lecuter Notes in Computer Science, vol. 1848, Springer, Berlin, 2000, pp. 305-319 · Zbl 0964.68511
[16] Manber, U.; Myers, G., Suffix arrays: a new method for on-line string searches, SIAM J. comput., 22, 5, 935-948, (1993) · Zbl 0784.68027
[17] U. Manber, S. Wu, GLIMPSE: A tool to search through entire file systems, in: Proceedings of the USENIX Winter 1994 Technical Conference, 1994, pp. 23-32
[18] E. Moura, G. Navarro, N. Ziviani, Indexing compressed text, in: N. Ziviani, R. Baeza-Yates, K. Guimar√£es (Eds.), Proceedings of the Fourth South American Workshop on String Processing, Carleton University Press, 1997
[19] E. Moura, G. Navarro, N. Ziviani, R. Baeza-Yates, Fast searching on compressed text allowing errors, in: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval 1998, 298-306
[20] K. Sadakane, Compressed text databases with efficient query algorithms based on the compressed suffix array, in: Proceeding of the 11th International Symposium on Algorithms and Computation (ISAAC ’00), Lecture Notes in Computer Science, Springer, Berlin, 2000 · Zbl 1044.68587
[21] J. Seward, The {\scbzip2} home page, 1997, http://sourceware.cygnus.com/bzip2/index.html
[22] I.H. Witten, A. Moffat, T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, second ed., Morgan Kaufmann, Los Altos, CA 94022, USA, 1999 · Zbl 0821.68051
[23] Witten, I.H.; Neal, R.M.; Cleary, J.G., Arithmetic coding for data compression, Commun. ACM, 30, 6, 520-540, (1987)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.