# zbMATH — the first resource for mathematics

A metric index for approximate string matching. (English) Zbl 1087.68027
Summary: We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the $$occ$$ occurrences of a pattern of length $$m$$, permitting up to $$r$$ differences, in a text of length $$n$$ over an alphabet of size $$\sigma$$, in average time $$O(m^{1+ \varepsilon}+occ)$$ for any $$\varepsilon>0$$, if $$r=o(m/\log_\sigma m)$$ and $$m>((1 +\varepsilon)/ \varepsilon)\log_\sigma n$$. The index works well up to $$r<(3-\sqrt 2)m/ \log_\sigma m$$, where it achieves its maximum average search complexity $$O(m^{1+\sqrt 2+\varepsilon}+occ)$$. The construction time of the index is $$O(m^{1+ \sqrt 2+\varepsilon}n\log n)$$ and its space is $$O(m^{1+\sqrt 2+\varepsilon}n)$$. This is the first index achieving average search time polynomial in $$m$$ and independent of $$n$$, for $$r=O (m/\log_\sigma m)$$. Previous methods achieve this complexity only for $$r=O(m/\log_\sigma n)$$. We also present a simpler scheme needing $$O(n)$$ space.

##### MSC:
 68P10 Searching and sorting
##### Software:
GLIMPSE; PATRICIA
Full Text:
##### References:
 [1] Apostolico, A., The myriad virtues of subword trees, (), 85-96 · Zbl 0572.68067 [2] Apostolico, A.; Galil, Z., Combinatorial algorithms on words, (1985), Springer New York [3] Baeza-Yates, R.; Navarro, G., Fast approximate string matching in a dictionary, (), 14-22 [4] Baeza-Yates, R.; Navarro, G., A practical $$q$$-Gram index for text retrieval allowing errors, CLEI electron. J., 1, 2, (1998), $$\langle$$⟩ [5] Baeza-Yates, R.; Navarro, G., Block-addressing indices for approximate text retrieval, J. amer. soc. inform. sci., 51, 1, 69-82, (2000) [6] Baeza-Yates, R.; Navarro, G., A hybrid indexing method for approximate string matching, J. discrete algorithms, 1, 1, 205-239, (2000) [7] Beckmann, N.; Kriegel, H.; Schneider, R.; Seeger, B., The $$\operatorname{R}^*$$-tree: an efficient and robust access method for points and rectangles, (), 322-331 [8] Böhm, C.; Berchtold, S.; Keim, D., Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases, ACM comp. surveys, 33, 3, 322-373, (2001) [9] A. Buchsbaum, M. Goodrich, J. Westbrook, Range searching over tree cross products, in: Proc. ESA’00, Lecture Notes in Computer Science, Vol. 1879, Springer, Berlin, 2000, pp. 120-131. · Zbl 0974.68510 [10] Bugnion, E.; Roos, T.; Shi, F.; Widmayer, P.; Widmer, F., Approximate multiple string matching using spatial indexes, (), 43-54 [11] E. Chávez, G. Navarro, A metric index for approximate string matching, in: Proc. LATIN’02, Lecture Notes in Computer Science, Vol. 2286, 2002, pp. 181-195. · Zbl 1059.68637 [12] Chávez, E.; Navarro, G., A compact space decomposition for effective metric indexing, Pattern recognition lett., 26, 9, 1363-1376, (2005) [13] Chávez, E.; Navarro, G.; Baeza-Yates, R.; Marroquín, J., Proximity searching in metric spaces, ACM comp. surveys, 33, 3, 273-321, (2001) [14] A. Cobbs, Fast approximate matching using suffix trees, in: Proc. CPM’95, Lecture Notes in Computer Science, Vol. 937, Springer, Berlin, 1995, pp. 41-54. [15] Cole, R.; Gottlieb, L.; Lewenstein, M., Dictionary matching and indexing with errors and don’t cares, (), 91-100 · Zbl 1192.68818 [16] Crochemore, M., Transducers and repetitions, Theoret. comput. sci., 45, 63-86, (1986) · Zbl 0615.68053 [17] Faloutsos, C.; Kamel, I., Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension, (), 4-13 [18] K. Fredriksson, Metric indexes for approximate string matching in a dictionary, in: Proc. SPIRE’04, Lecture Notes in Computer Science, Vol. 3246, 2004, pp. 212-213. · Zbl 1111.68756 [19] Gaede, V.; Günther, O., Multidimensional access methods, ACM comp. surveys, 30, 2, 170-231, (1998) [20] G. Gonnet, A tutorial introduction to Computational Biochemistry using Darwin, Technical Report, Informatik E.T.H., Zuerich, Switzerland, 1992. [21] Gusfield, D., Algorithms on strings trees and sequences: computer science and computational biology, (1997), Cambridge University Press Cambridge · Zbl 0934.68103 [22] Guttman, A., R-trees: a dynamic index structure for spatial searching, (), 47-57 [23] Hjaltason, G.; Samet, H., Index-driven similarity search in metric spaces, ACM trans. database systems, 28, 4, 517-580, (2003) [24] P. Jokinen, E. Ukkonen, Two algorithms for approximate string matching in static texts, in: Proc. MFCS’91, Vol. 16, 1991, pp. 240-248. · Zbl 0776.68047 [25] Kamel, I.; Faloutsos, C., On packing R-trees, (), 490-499 [26] J. Kärkkäinen, P. Sanders, Simple linear work suffix array construction, in: Proc. ICALP’03, Lecture Notes in Computer Science, Vol. 2719, 2003, pp. 943-955. · Zbl 1039.68042 [27] D. Kim, J. Sim, H. Park, K. Park. Linear-time construction of suffix arrays, in: Proc. CPM’03, Lecture Notes in Computer Science, Vol. 2676, 2003, pp. 186-199. · Zbl 1279.68068 [28] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays, in: Proc. CPM’03, Lecture Notes in Computer Science, Vol. 2676, 2003, pp. 200-210. · Zbl 1279.68069 [29] Levenshtein, V., Binary codes capable of correcting spurious insertions and deletions of ones, Problems inform. transmission, 1, 8-17, (1965) · Zbl 0149.15905 [30] M. Maas, J. Nowak, A new method for approximate indexing and dictionary lookup with one error, Inform. Process. Lett. 96 (5) (2005) 185-191. · Zbl 1184.68211 [31] U. Manber, E. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. Comput. (1993) 935-948. · Zbl 0784.68027 [32] U. Manber, S. Wu, \scGLIMPSE: a tool to search through entire file systems, in: Proc. USENIX Technical Conference, Winter 1994, pp. 23-32. [33] McCreight, E., A space-economical suffix tree construction algorithm, J. ACM, 23, 2, 262-272, (1976) · Zbl 0329.68042 [34] Morrison, D., PATRICIA—practical algorithm to retrieve information coded in alphanumeric, J. ACM, 15, 4, 514-534, (1968) [35] Myers, E., A sublinear algorithm for approximate keyword searching, Algorithmica, 12, 4/5, 345-374, (1994) · Zbl 0941.68560 [36] Navarro, G., A guided tour to approximate string matching, ACM comp. surveys, 33, 1, 31-88, (2001) [37] Navarro, G.; Baeza-Yates, R.; Sutinen, E.; Tarhio, J., Indexing methods for approximate string matching, IEEE data eng. bull., 24, 4, 19-27, (2001) [38] Navarro, G.; Sutinen, E.; Tarhio, J., Indexing text with approximate $$q$$-grams, J. discrete algorithms, 3, 2-4, 157-175, (2005) · Zbl 1101.68508 [39] Pagel, B.; Six, H.; Toben, H.; Widmayer, P., Towards an analysis of range queries, (), 221-241 [40] Papadias, D.; Theodoridis, Y.; Stefanakis, E., Multidimensional range queries with spatial relations, Geographical systems, 4, 4, 343-365, (1997) [41] Robinson, J., The K-D-B-tree: a search structure for large multidimensional dynamic indexes, (), 10-18 [42] Sedgewick, R.; Flajolet, P., Analysis of algorithms, (1996), Addison-Wesley Reading, MA [43] Shi, F., Fast approximate string matching with $$q$$-blocks sequences, (), 257-271 [44] E. Sutinenm, J. Tarhio, Filtration with $$q$$-samples in approximate string matching, in: Proc. CPM’96, Lecture Notes in Computer Science, Vol. 1075, Springer, Berlin, 1996, pp. 50-61. [45] W. Szpankowski, Probabilistic analysis of generalized suffix trees, in: Proc. CPM’92, Lecture Notes in Computer Science, Vol. 644, Springer, Berlin, 1992, pp. 1-14. [46] Thedoridis, Y.; Sellis, T., A model for the prediction of R-tree performance, (), 161-171 [47] Ukkonen, E., Approximate string matching over suffix trees, (), 228-242 [48] Ukkonen, E., Constructing suffix trees on-line in linear time, Algorithmica, 14, 3, 249-260, (1995) · Zbl 0831.68027 [49] D. White, R. Jain, Algorithms and strategies for similarity retrieval, Technical Report VCL-96-101, Visual Comp. Lab., University of California, July 1996.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.