×

zbMATH — the first resource for mathematics

XML compression via directed acyclic graphs. (English) Zbl 1352.68079
Summary: Unranked node-labeled trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size (= number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the hybrid dag, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. We determine the average sizes of unranked and binary dags over a given set of labels (under uniform distribution) in terms of their exact generating functions, and in terms of their asymptotical behavior.

MSC:
68P15 Database theory
05C20 Directed graphs (digraphs), tournaments
68P05 Data structures
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
Software:
TreeRePair; XMill
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Arion, A., Bonifati, A., Manolescu, I., Pugliese, A.: XQueC: A query-conscious compressed XML database. ACM Trans. Intern. Tech. 7(2) (2007) · Zbl 1115.68101
[2] Bakibayev, N; Olteanu, D; Zavodny, J, Fdb: A query engine for factorised relational databases, PVLDB, 5, 1232-1243, (2012)
[3] Bille, P., Landau, G. M., Raman, R., Sadakane, K., Satti, S. R., Weimann, O.: Random Access to Grammar-Compressed Strings. In: SODA, pp. 373-389 (2011) · Zbl 1375.68229
[4] Buneman, P., Grohe, M., Koch, C.: Path Queries on Compressed XML. In: VLDB, pp. 141-152 (2003) · Zbl 0894.68072
[5] Busatto, G; Lohrey, M; Maneth, S, Efficient memory representation of XML document trees, Inf. Syst., 33, 456-474, (2008)
[6] de Bruijn, N.G., Knuth, D.E., Rice, S.O.: Graph theory and computing, pp 15-22. Academic Press, New York (1972)
[7] Dershowitz, N; Zaks, S, Enumerations of ordered trees, Discret. Math., 31, 9-28, (1980) · Zbl 0443.05049
[8] Downey, PJ; Sethi, R; Tarjan, RE, Variations on the common subexpression problem, J. ACM, 27, 758-771, (1980) · Zbl 0458.68026
[9] Ershov, AP, On programming of arithmetic operations, Commun. ACM, 1, 3-9, (1958) · Zbl 0086.33203
[10] Flajolet, P; Odlyzko, A, The average height of binary trees and other simple trees, J. Comput. Syst. Sci., 25, 171-213, (1982) · Zbl 0499.68027
[11] Flajolet, P; Odlyzko, A, Singularity analysis of generating functions, SIAM J. Discret. Math., 3, 216-240, (1990) · Zbl 0712.05004
[12] Flajolet, P., Sedgewick, R.: Analytic Combinatorics. Cambridge University Press (2009) · Zbl 1165.05001
[13] Flajolet, P., Sipala, P., Steyaert, J.-M.: Analytic Variations on the Common Subexpression Problem. In: ICALP, pp. 220-234 (1990) · Zbl 0765.68048
[14] Knuth, D.E.: The Art of Computer Programming, Vol. I: Fundamental Algorithms. Addison-Wesley (1968) · Zbl 0191.17903
[15] Koch, C.: Efficient processing of expressive node-selecting queries on XML data in secondary storage: A tree automata-based approach. In: VLDB, pp. 249-260 (2003)
[16] Larsson, N. J., Moffat, A.: Offline Dictionary-Based Compression. In: DCC, pp. 296-305 (1999)
[17] Liefke, H., XMILL, D. Suciu.: An Efficient Compressor for XML Data. In: SIGMOD Conference, pp. 153-164 (2000)
[18] Lohrey, M, Algorithmics on SLP-compressed strings: A survey, Groups Complexity Cryptol., 4, 241-299, (2013) · Zbl 1285.68088
[19] Lohrey, M; Maneth, S, The complexity of tree automata and xpath on grammar-compressed trees, Theor. Comput. Sci., 363, 196-210, (2006) · Zbl 1153.68402
[20] Lohrey, M; Maneth, S; Mennicke, R, XML tree structure compression using repair, Inf. Syst., 38, 1150-1167, (2013)
[21] Lohrey, M., Maneth, S., Noeth, E.: XML Compression via Dags. In: ICDT, pp. 69-80 (2013)
[22] Lohrey, M; Maneth, S; Schmidt-Schauß, M, Parameter reduction and automata evaluation for grammar-compressed trees, J. Comput. Syst. Sci., 78, 1651-1669, (2012) · Zbl 1246.68114
[23] Maneth, S., Sebastian, T.: Fast and tiny structural self-indexes for XML. CoRR arXiv: abs/1012.5696 (2010)
[24] Marckert, J-F, The rotation correspondence is asymptotically a dilatation, Random Struct. Algorithm., 24, 118-132, (2004) · Zbl 1034.05016
[25] Meinel, C., Theobald, T.: Algorithms and Data Structures in VLSI Design: OBDD - Foundations and Applications. Springer (1998) · Zbl 0899.68040
[26] Neven, F, Automata theory for XML researchers, SIGMOD Rec., 31, 39-46, (2002)
[27] Nevill-Manning, CG; Witten, IH, Identifying hierarchical strcture in sequences: A linear-time algorithm, J. Artif. Intell. Res. (JAIR), 7, 67-82, (1997) · Zbl 0894.68072
[28] Plandowski, W.: Testing equivalence of morphisms on context-free languages. In: ESA, pp. 460-470 (1994) · Zbl 1246.68114
[29] Schwentick, T, Automata for XML - a survey, J. Comput. Syst. Sci., 73, 289-315, (2007) · Zbl 1115.68101
[30] Suciu, D.: Typechecking for semistructured data. In: DBPL, pp. 1-20 (2001) · Zbl 1098.68555
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.