×

A theoretical and experimental study on the construction of suffix arrays in external memory. (English) Zbl 0995.68032

Summary: The construction of full-text indexes on very large text collections is nowadays a hot problem. The suffix array is one of the most attractive full-text indexing data structures due to its simplicity, space efficiency and powerful/fast search operations supported. In this paper we analyze, both theoretically and experimentally, the I/O complexity and the working space of six algorithms for constructing large suffix arrays. Three of them are state-of-the-art, the other three algorithms are our new proposals. We perform a set of experiments based on three different data sets (English texts, amino-acid sequences and random texts) and give a precise hierarchy of these algorithms according to their working-space versus construction-time tradeoff. Given the current trends in model design and disk technology, we pose particular attention to differentiate between “random” and “contiguous” disk accesses, in order to explain reasonably some practical I/O phenomena which are related to the experimental behavior of these algorithms and that would otherwise be meaningless in the light of other simpler external-memory models.
We also address two other issues. The former is concerned with the problem of building word indexes; we show that our results can be successfully applied to this case too, without any loss in efficiency and without compromising the simplicity of programming to achieve a uniform, simple and efficient approach to both the two indexing models. The latter issue is related to the intriguing and apparently counterintuitive “contradiction” between the effective practical performance of the well-known Baeza-Yates-Gonnet-Snider algorithm, verified in our experiments, and its unappealing worst-case behavior. We devise a new external-memory algorithm that follows the basic philosophy underlying that algorithm but in a significantly different manner, thus resulting in a novel approach which combines good worst-case bounds with efficient practical performance.

MSC:

68P10 Searching and sorting
68W05 Nonnumerical algorithms

Software:

LEDA
PDFBibTeX XMLCite
Full Text: DOI Link