zbMATH — the first resource for mathematics

A Lempel-Ziv text index on secondary storage. (English) Zbl 1138.68381
Ma, Bin (ed.) et al., Combinatorial pattern matching. 18th annual symposium, CPM 2007, London, Canada, July 9–11, 2007. Proceedings. Berlin: Springer (ISBN 978-3-540-73436-9/pbk). Lecture Notes in Computer Science 4580, 83-94 (2007).
Summary: Full-text searching consists in locating the occurrences of a given pattern \(P[1..m]\) in a text \(T[1..u]\), both sequences over an alphabet of size \(\sigma \). In this paper we define a new index for full-text searching on secondary storage, based on the Lempel-Ziv compression algorithm and requiring \(8uH _{k } + o(u \log \sigma )\) bits of space, where \(H _{k }\) denotes the \(k\)-th order empirical entropy of \(T\), for any \(k = o(\log _{\sigma } u)\). Our experimental results show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4–2.3 times the text size including the text, which means 39%–65% the size of traditional indexes like String B-trees [P. Ferragina and R. Grossi, “The string B-tree: a new data structure for string search in external memory and its applications”, J. ACM 46, No. 2, 236–280 (1999; Zbl 1065.68518)]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.04–1.68 times the text size, requiring about 20–60 disk accesses, depending on the pattern length.
For the entire collection see [Zbl 1136.68008].

68P10 Searching and sorting
68P05 Data structures
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
Full Text: DOI