×

Discovery of frequent tag tree patterns in semistructured web documents. (English) Zbl 1048.68848

Chen, Ming-Syan (ed.) et al., Advances in knowledge discovery and data mining. 6th Pacific-Asia conference, PAKDD 2002, Taipei, Taiwan, May 6–8, 2002. Proceedings. Berlin: Springer (ISBN 3-540-43704-5). Lect. Notes Comput. Sci. 2336, 341-355 (2002).
Summary: Many Web documents such as HTML files and XML files have no rigid structure and are called semistructured data. In general, such semistructured Web documents are represented by rooted trees with ordered children. We propose a new method for discovering frequent tree structured patterns in semistructured Web documents by using a tag tree pattern as a hypothesis. A tag tree pattern is an edge labeled tree with ordered children which has structured variables. An edge label is a tag or a keyword in such Web documents, and a variable can be substituted by an arbitrary tree. So a tag tree pattern is suited for representing tree structured patterns in such Web documents. First we show that it is hard to compute the optimum frequent tag tree pattern. So we present an algorithm for generating all maximally frequent tag tree patterns and give the correctness of it. Finally, we report some experimental results on our algorithm. Although this algorithm is not efficient, experiments show that we can extract characteristic tree structured patterns in those data.
For the entire collection see [Zbl 0992.68521].

MSC:

68U99 Computing methodologies and applications
68P15 Database theory
68P20 Information storage and retrieval of data

Software:

RSBR_
PDFBibTeX XMLCite
Full Text: Link