# zbMATH — the first resource for mathematics

Optimal algorithms for comparing trees with labeled leaves. (English) Zbl 0589.62044
Summary: Let $$R_ n$$ denote the set of rooted trees with n leaves in which: the leaves are labeled by the integer in $$\{$$ 1,...,n$$\}$$ ; and among interior vertices only the root may have degree two. Associated with each interior vertex v in such a tree is the subset, or cluster, of leaf labels in the subtree rooted at v. Cluster $$\{$$ 1,...,n$$\}$$ is called trivial.
Clusters are used in quantitative measures of similarity, dissimilarity and consensus among trees. For any k trees in $$R_ n$$, the strict consensus tree $$C(T_ 1,...,T_ k)$$ is that tree in $$R_ n$$ containing exactly those clusters common to every one of the k trees. Similarity between trees $$T_ 1$$ and $$T_ 2$$ in $$R_ n$$ is measured by the number $$S(T_ 1,T_ 2)$$ of nontrivial clusters in both $$T_ 1$$ and $$T_ 2$$; dissimilarity, by the number $$D(T_ 1,T_ 2)$$ of clusters in $$T_ 1$$ or $$T_ 2$$ but not in both. Algorithms are known to compute $$C(T_ 1,...,T_ k)$$ in $$O(kn^ 2)$$ time, and $$S(T_ 1,T_ 2)$$ and $$D(T_ 1,T_ 2)$$ in $$O(n^ 2)$$ time.
I propose a special representation of the clusters of any tree T in $$R_ n$$, one that permits testing in constant time whether a given cluster exists in T. I describe algorithms that exploit this representation to compute $$C(T_ 1,...,T_ k)$$ in O(kn) time, and $$S(T_ 1,T_ 2)$$ and $$D(T_ 1,T_ 2)$$ in O(n) time. These algorithms are optimal in a technical sense. They enable well-known indices of consensus between two trees to be computed in O(n) time. All these results apply as well to comparable problems involving unrooted trees with labeled leaves.

##### MSC:
 62H30 Classification and discrimination; cluster analysis (statistical aspects) 68Q25 Analysis of algorithms and problem complexity 62-04 Software, source code, etc. for problems pertaining to statistics
Full Text: