×

Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. (English) Zbl 1305.68078


MSC:

68P20 Information storage and retrieval of data
68P05 Data structures
68U05 Computer graphics; computational geometry (digital and algorithmic aspects)
68W32 Algorithms on strings
68-02 Research exposition (monographs, survey articles) pertaining to computer science
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] A. Aho, J. Hopcroft, and J. Ullman. 1974. The Design and Analysis of Computer Algorithms. Addison-Wesley. · Zbl 0326.68005
[2] A. Apostolico. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. NATO ISI Series. Springer-Verlag, 85–96. · Zbl 0572.68067 · doi:10.1007/978-3-642-82456-2_6
[3] R. Baeza-Yates and B. Ribeiro-Neto. 2011. Modern Information Retrieval, 2nd ed. Addison-Wesley.
[4] J. Barbay, F. Claude, T. Gagie, G. Navarro, and Y. Nekrich. 2014. Efficient fully-compressed sequence representations. Algorithmica 69, 1, 232–268. · Zbl 1307.68029 · doi:10.1007/s00453-012-9726-3
[5] J. Barbay, A. López-Ortiz, T. Lu, and A. Salinger. 2009. An experimental investigation of set intersection algorithms for text searching. ACM Journal of Experimental Algorithmics 14, art. 7. · Zbl 1284.68222
[6] A. Bartsch, B. Bunk, I. Haddad, J. Klein, R. Münch, T. Johl, U. Kärst, L. Jänsch, D. Jahn, and I. Retter. 2011. GeneReporter – sequence-based document retrieval and annotation. Bioinformatics 27, 7, 1034–1035.
[7] H. Bast, C. Mortensen, and I. Weber. 2008. Output-sensitive autocompletion search. Information Retrieval 11, 4, 269–286. · Zbl 05343845 · doi:10.1007/s10791-008-9048-x
[8] D. Belazzougui, P. Boldi, R. Pagh, and S. Vigna. 2009. Monotone minimal perfect hashing: searching a sorted table with O(1) accesses. In Proc. 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 785–794. · doi:10.1137/1.9781611973068.86
[9] D. Belazzougui and G. Navarro. 2011. Alphabet-independent compressed text indexing. In Proc. 19th Annual European Symposium on Algorithms (ESA). LNCS 6942. 748–759. · Zbl 1325.68307
[10] D. Belazzougui and G. Navarro. 2012. New lower and upper bounds for representing sequences. In Proc. 20th Annual European Symposium on Algorithms (ESA). LNCS 7501. 181–192. · Zbl 1365.68260
[11] D. Belazzougui, G. Navarro, and D. Valenzuela. 2013. Improved compressed indexes for full-text document retrieval. Journal of Discrete Algorithms 18, 3–13. · Zbl 1268.68075 · doi:10.1016/j.jda.2012.07.005
[12] M. Bender and M. Farach-Colton. 2000. The LCA problem revisited. In Proc. 4th Latin American Theoretical Informatics Symposium (LATIN). LNCS 1776. 88–94. · Zbl 0959.68133
[13] O. Berkman and U. Vishkin. 1993. Recursive star-tree parallel data structure. SIAM Journal on Computing 22, 2, 221–242. · Zbl 0770.68044 · doi:10.1137/0222017
[14] P. Bose, M. He, A. Maheshwari, and P. Morin. 2009. Succinct orthogonal range search structures on a grid with applications to text indexing. In Proc. 20th Symposium on Algorithms and Data Structures (WADS). LNCS 5664. 98–109. · Zbl 1253.68103
[15] G. Brodal, B. Gfeller, A. Jørgensen, and P. Sanders. 2011. Towards optimal range medians. Theoretical Computer Science 412, 24, 2588–2601. · Zbl 1220.68052
[16] F. Brown. 2005. Editorial opinion: Chemoinformatics—a ten year update. Current Opinion in Drug Discovery & Development 8, 3, 296–302.
[17] S. Büttcher, C. Clarke, and G. Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press. · Zbl 1211.68176
[18] M. Celikik and H. Bast. 2009. Fast error-tolerant search on very large texts. In Proc. 24th ACM Symposium on Applied Computing (SAC). 1724–1731.
[19] T. Chan, S. Durocher, K. Larsen, J. Morrison, and B. Wilkinson. 2012. Linear-space data structures for range mode query in arrays. In Proc. 29th International Symposium on Theoretical Aspects of Computer Science (STACS). 290–301. · Zbl 1245.68071
[20] D. Clark. 1996. Compact PAT trees. Ph.D. thesis, University of Waterloo, Canada.
[21] H. Cohen and E. Porat. 2010. Fast set intersection and two-patterns matching. Theoretical Computer Science 411, 40–42, 3795–3800. · Zbl 1207.68270 · doi:10.1016/j.tcs.2010.06.002
[22] M. Crochemore and W. Rytter. 2002. Jewels of Stringology. World Scientific. · Zbl 1078.68151 · doi:10.1142/4838
[23] B. Croft, D. Metzler, and T. Strohman. 2009. Search Engines: Information Retrieval in Practice. Pearson Education.
[24] S. Culpepper, G. Navarro, S. Puglisi, and A. Turpin. 2010. Top-k ranked document search in general text databases. In Proc. 18th Annual European Symposium on Algorithms (ESA). LNCS 6347. 194–205 (part II). · Zbl 1287.68035
[25] S. Culpepper, M. Petri, and F. Scholer. 2012. Efficient in-memory top-k document retrieval. In Proc. 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR). 225–234.
[26] M. Farach. 1997. Optimal suffix tree construction with large alphabets. In Proc. 38th IEEE Symposium on Foundations of Computer Science (FOCS). 137–143. · doi:10.1109/SFCS.1997.646102
[27] A. Fariña, N. Brisaboa, G. Navarro, F. Claude, A. Places, and E. Rodríguez. 2012. Word-based self-indexes for natural language text. ACM Transactions on Information Systems 30, 1, art. 1.
[28] H. Ferrada and G. Navarro. 2013. A Lempel-Ziv compressed structure for document listing. In Proc. 20th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS 8214. 116–128.
[29] P. Ferragina, R. González, G. Navarro, and R. Venturini. 2009. Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics 13, art. 12. · Zbl 1284.68255
[30] P. Ferragina, N. Koudas, S. Muthukrishnan, and D. Srivastava. 2003. Two-dimensional substring indexing. Journal of Computer and System Sciences 66, 4, 763–774. · Zbl 1054.68043 · doi:10.1016/S0022-0000(03)00028-X
[31] J. Fischer, T. Gagie, T. Kopelowitz, M. Lewenstein, V. Mäkinen, L. Salmela, and N. Välimäki. 2012. Forbidden patterns. In Proc. 10th Latin American Symposium on Theoretical Informatics (LATIN). LNCS 7256. 327–337.
[32] J. Fischer and V. Heun. 2011. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing 40, 2, 465–492. · Zbl 1222.05024 · doi:10.1137/090779759
[33] H. Gabow, J. Bentley, and R. Tarjan. 1984. Scaling and related techniques for geometry problems. In Proc. 16th ACM Symposium on Theory of Computing (STOC). 135–143.
[34] T. Gagie, J. Kärkkäinen, G. Navarro, and S. Puglisi. 2013. Colored range queries and document retrieval. Theoretical Computer Science 483, 36–50. · Zbl 1292.68045
[35] T. Gagie, G. Navarro, and S. J. Puglisi. 2012. New algorithms on wavelet trees and applications to information retrieval. Theoretical Computer Science 426–427, 25–41. · Zbl 1243.68161
[36] T. Gagie, S. J. Puglisi, and A. Turpin. 2009. Range quantile queries: Another virtue of wavelet trees. In Proc. 16th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS 5721. 1–6.
[37] A. Golynski, I. Munro, and S. Rao. 2006. Rank/select operations on large alphabets: a tool for text indexing. In Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 368–373. · Zbl 1192.68800
[38] G. Gonnet, R. Baeza-Yates, and T. Snider. 1992. New indices for text: PAT trees and PAT arrays. In Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 66–82.
[39] R. González and G. Navarro. 2007. Compressed text indexes with fast locate. In Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 4580. 216–227. · Zbl 1138.68415
[40] M. Greve, A. G. Jørgensen, K. D. Larsen, and J. Truelsen. 2010. Cell probe lower bounds and approximations for range mode. In Proc. 37th International Colloquim on Automata, Languages and Programming (ICALP). 605–616. · Zbl 1288.68046
[41] R. Grossi, A. Gupta, and J. Vitter. 2003. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 841–850. · Zbl 1092.68584
[42] R. Grossi, A. Orlandi, and R. Raman. 2010. Optimal trade-offs for succinct string indexes. In Proc. 37th International Colloquim on Automata, Languages and Programming (ICALP). 678–689. · Zbl 1288.68047
[43] P. Gupta, R. Janardan, and M. Smid. 1995. Further results on generalized intersection searching problems: Counting, reporting, and dynamization. Journal of Algorithms 19, 2, 282–317. · Zbl 0839.68106 · doi:10.1006/jagm.1995.1038
[44] D. Gusfield. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press. · Zbl 0934.68103 · doi:10.1017/CBO9780511574931
[45] D. Harel and R. Tarjan. 1984. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing 13, 2, 338–355. · Zbl 0535.68022 · doi:10.1137/0213024
[46] H. Heaps. 1978. Information Retrieval: Theoretical and Computational Aspects. Academic Press. · Zbl 0471.68075
[47] W.-K. Hon, M. Patil, R. Shah, and S.-B. Wu. 2010a. Efficient index for retrieving top-k most frequent documents. Journal of Discrete Algorithms 8, 4, 402–417. · Zbl 1215.68095 · doi:10.1016/j.jda.2010.08.003
[48] W.-K. Hon, R. Shah, and S. Thankachan. 2012a. Towards an optimal space-and-query-time index for top-k document retrieval. In Proc. 23rd Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 7354. 173–184. · Zbl 1358.68092
[49] W.-K. Hon, R. Shah, S. Thankachan, and J. Vitter. 2010b. String retrieval for multi-pattern queries. In Proc. 17th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS 6393. 55–66.
[50] W.-K. Hon, R. Shah, S. Thankachan, and J. Vitter. 2012b. Document listing for queries with excluded pattern. In Proc. 23rd Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 7354. 185–195. · Zbl 1358.68093
[51] W.-K. Hon, R. Shah, S. Thankachan, and J. Vitter. 2013. Faster compressed top-k document retrieval. In Proc. 23rd Data Compression Conference (DCC). 341–350.
[52] W.-K. Hon, R. Shah, and J. Vitter. 2009. Space-efficient framework for top-kstring retrieval problems. In Proc. 50th IEEE Annual Symposium on Foundations of Computer Science (FOCS). 713–722. · Zbl 1292.68182
[53] W.-K. Hon, R. Shah, and J. S. Vitter. 2010c. Compression, indexing, and retrieval for massive string data. In Proc. 21st Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 6129. 260–274. · Zbl 1286.68118
[54] P. Hsu and G. Ottaviano. 2013. Space-efficient data structures for top-k completion. In Proc. 22nd World Wide Web Conference (WWW). 583–594. · doi:10.1145/2488388.2488440
[55] L. Hui. 1992. Color set size problem with applications to string matching. In Proc. 3rd Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 644. 227–240.
[56] R. Janardan and M. Lopez. 1993. Generalized intersection searching problems. International Journal of Computational Geometry Applications 3, 1, 39–69. · Zbl 0777.68078 · doi:10.1142/S021819599300004X
[57] A. Jørgensen and K. Larsen. 2011. Range selection and median: Tight cell probe lower bounds and adaptive data structures. In Proc. 22nd Symposium on Discrete Algorithms (SODA). 805–813. · Zbl 1373.68196
[58] J. Kärkkäinen, P. Sanders, and S. Burkhardt. 2006. Linear work suffix array construction. Journal of the ACM 53, 6, 918–936. · Zbl 1326.68111
[59] M. Karpinski and Y. Nekrich. 2011. Top-k color queries for document retrieval. In Proc. 22nd Symposium on Discrete Algorithms (SODA). 401–411. · Zbl 1373.68197
[60] D. Kim, J. Sim, H. Park, and K. Park. 2005. Constructing suffix arrays in linear time. Journal of DiscreteAlgorithms 3, 2–4, 126–142. · Zbl 1101.68505 · doi:10.1016/j.jda.2004.08.019
[61] P. Ko and S. Aluru. 2005. Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms 3, 2–4, 143–156. · Zbl 1101.68506
[62] R. Konow and G. Navarro. 2013. Faster compact top-k document retrieval. In Proc. 23rd Data Compression Conference (DCC). 351–360.
[63] K. Larsen and F. van Walderveen. 2013. Near-optimal range reporting structures for categorical data. In Proc. 24th Symposium on Discrete Algorithms (SODA). 265–276.
[64] M. Lewenstein. 2013. Orthogonal range searching for text indexing. CoRR arXiv:1306.0615. · Zbl 1394.68099
[65] E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi. 2009. Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery 18, 2, 300–336. · Zbl 05659239 · doi:10.1007/s10618-008-0118-x
[66] B. Liu. 2007. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer. · Zbl 1138.68322
[67] C. Makris. 2012. Wavelet trees: a survey. Computer Science and Information Systems 9, 2, 585–625. · doi:10.2298/CSIS110606004M
[68] U. Manber and G. Myers. 1993. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22, 5, 935–948. · Zbl 0784.68027 · doi:10.1137/0222058
[69] G. Manzini. 2001. An analysis of the Burrows-Wheeler transform. Journal of the ACM 48, 3, 407–430. · Zbl 1323.68262 · doi:10.1145/382780.382782
[70] Y. Matias, S. Muthukrishnan, S. Sahinalp, and J. Ziv. 1998. Augmenting suffix trees, with applications. In Proc. 6th European Symposium on Algorithms (ESA). LNCS 1461. 67–78.
[71] E. McCreight. 1976. A space-economical suffix tree construction algorithm. Journal of the ACM 23, 2, 262–272. · Zbl 0329.68042 · doi:10.1145/321941.321946
[72] K. Mehlhorn. 1984. Data Structures and Algorithms 1: Sorting and Searching. EATCS Monographs on Theoretical Computer Science. Springer-Verlag. · Zbl 0556.68001 · doi:10.1007/978-3-642-69672-5
[73] I. Munro. 1996. Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). LNCS 1180. 37–42.
[74] S. Muthukrishnan. 2002. Efficient algorithms for document retrieval problems. In Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 657–666. · Zbl 1093.68588
[75] G. Navarro. 2012. Wavelet trees for all. In Proc. 23rd Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 7354. 2–26. · Zbl 1358.68081
[76] G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio. 2001. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24, 4, 19–27.
[77] G. Navarro and V. Mäkinen. 2007. Compressed full-text indexes. ACM Computing Surveys 39, 1, Article 2.
[78] G. Navarro and Y. Nekrich. 2012. Top-kdocument retrieval in optimal time and linear space. In Proc. 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1066–1078. · doi:10.1137/1.9781611973099.84
[79] G. Navarro and Y. Nekrich. 2013. Optimal top-kdocument retrieval. CoRR arXiv:1307.6789.
[80] G. Navarro, Y. Nekrich, and L. Russo. 2013. Space-efficient data-analysis queries on grids. Theoretical Computer Science 482, 60–72. · Zbl 1291.68155 · doi:10.1016/j.tcs.2012.11.031
[81] G. Navarro, S. Puglisi, and D. Valenzuela. 2011. Practical compressed document retrieval. In Proc. 10th International Symposium on Experimental Algorithms (SEA). LNCS 6630. 193–205.
[82] G. Navarro and S. Thankachan. 2013a. Faster top-kdocument retrieval in optimal space. In Proc. 20th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS 8214. 255–262.
[83] G. Navarro and S. Thankachan. 2013b. Top-kdocument retrieval in compact space and near-optimal time. In Proc. 24th Annual International Symposium on Algorithms and Computation (ISAAC). LNCS. To appear. · Zbl 1406.68022
[84] G. Navarro and D. Valenzuela. 2012. Space-efficient top-k document retrieval. In Proc. 11th International Symposium on Experimental Algorithms (SEA). LNCS 7276. 307–319.
[85] M. Patil, S. Thankachan, R. Shah, W.-K. Hon, J. Vitter, and S. Chandrasekaran. 2011. Inverted indexes for phrases and strings. In Proc. 34th International ACM Conference on Research and Development in Information Retrieval (SIGIR). 555–564.
[86] M. Pătraşcu. 2007. Lower bounds for 2-dimensional range counting. In Proc. 39th Annual ACM Symposium on Theory of Computing (STOC). 40–46.
[87] R. Raman, V. Raman, and S. S. Rao. 2007. Succinct indexable dictionaries with applications to encodingk-ary trees, prefix sums and multisets. ACM Transactions on Algorithms 3, 4, Article 43. · Zbl 1093.68582 · doi:10.1145/1290672.1290680
[88] R. Raman and S. S. Rao. 2003. Succinct dynamic dictionaries and trees. In Proc. 30th International Colloquium on Automata, Languages and Computation (ICALP). LNCS 2719. 357–368. · Zbl 1039.68043
[89] G. Rao and E. Xun. 2012. Word boundary information and Chinese word segmentation. International Journal on Asian Language Processing 22, 1, 15–32.
[90] K. Sadakane. 2007. Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms 5, 12–22. · Zbl 1137.68360 · doi:10.1016/j.jda.2006.03.011
[91] K. Sadakane and G. Navarro. 2010. Fully-functional succinct trees. In Proc. 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 134–149. · Zbl 1288.05046 · doi:10.1137/1.9781611973075.13
[92] G. Salton. 1968. Automatic Information Organization and Retrieval. McGraw–Hill.
[93] B. Schieber and U. Vishkin. 1988. On finding lowest common ancestors: Simplification and parallelization. SIAM Journal on Computing 17, 1253–1262. · Zbl 0669.68049 · doi:10.1137/0217079
[94] R. Shah, C. Sheng, S. V. Thankachan, and J. Vitter. 2013. Top-k document retrieval in external memory. In Proc. 21st Annual European Symposium on Algorithms (ESA). 803–814. · Zbl 1394.68129
[95] D. Shasha and P. Bonnet. 2003. Database Tuning Principles, Experiments, and Troubleshooting Techniques. Morgan Kaufmann.
[96] F. Silvestri. 2010. Mining query logs: Turning search usage data into knowledge. Foundations and Trends in Information Retrieval 4, 1–2, 1–174. · Zbl 1187.68072
[97] W. Szpankowski. 1993. A generalized suffix tree and its (un)expected asymptotic behaviors. SIAM Journal on Computiing 22, 6, 1176–1198. · Zbl 0799.68050 · doi:10.1137/0222070
[98] D. Tsur. 2013. Top-k document retrieval in optimal space. Information Processing Letters 113, 12, 440–443. · Zbl 1371.68071 · doi:10.1016/j.ipl.2013.03.012
[99] R. Typke, F. Wiering, and R. Veltkamp. 2005. A survey of music information retrieval systems. In Proc. 6th International Conference on Music Information Retrieval (ISMIR). 153–160.
[100] E. Ukkonen. 1995. On-line construction of suffix trees. Algorithmica 14, 3, 249–260. · Zbl 0831.68027 · doi:10.1007/BF01206331
[101] N. Välimäki and V. Mäkinen. 2007. Space-efficient algorithms for document retrieval. In Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 4580. 205–215.
[102] P. van Emde Boas, R. Kaas, and E. Zijlstra. 1977. Design and implementation of an efficient priority queue. Mathematical Systems Theory 10, 99–127. · Zbl 0363.60104 · doi:10.1007/BF01683268
[103] J. Vuillemin. 1980. A unifying look at data structures. Communications of the ACM 23, 4, 229–239. · Zbl 0434.68047 · doi:10.1145/358841.358852
[104] P. Weiner. 1973. Linear pattern matching algorithm. In Proc. 14th Annual IEEE Symposium on Switching and Automata Theory. 1–11. · doi:10.1109/SWAT.1973.13
[105] I. Witten, A. Moffat, and T. Bell. 1999. Managing Gigabytes, 2nd ed. Morgan Kaufmann Publishers. · Zbl 0821.68051
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.