zbMATH — the first resource for mathematics

Towards a framework for near-duplicate detection in document collections based on closed sets of attributes. (English) Zbl 1190.68021
Summary: Around 30% of the documents on the web have duplicates. Near-duplicate documents bear high similarity to each other, yet they are not bitwise identical. They are identical in terms of content but differ in a small portion of the document. Thus, algorithms for detecting these pages are needed. In the course of developing a near-duplicate detection system we present in this article an approach based on frequent closed sets of attributes for constructing clusters of duplicate documents, documents being represented by both syntactic and lexical methods.
We provide a prototype of a software environment for those who want to utilize such methods for finding near-duplicate documents in large text collections. This software includes two syntactic methods of finding near duplicate documents, a clustering technique based on frequent closed itemsets, means of evaluation of results and a tool for generating test collections of near-duplicate documents.
68P10 Searching and sorting
68M11 Internet topics