Comparison of alignment free string distances for complete genome phylogeny. (English) Zbl 1284.92073

Summary: In this paper, we compare the accuracy of four string distances on complete genomes to reconstruct phylogenies using simulated and real biological data. These distances are based on common words shared by raw genomic sequences and do not require preliminary processing steps such as gene identification or sequence alignment. Moreover, they are computable in linear time. The first distance is based on maximum significant matches (MSM). The second is computed from the frequencies of all the words of length \(k\) (KW). The third distance is based on the average length of maximum common substrings at any position (ACS). The last one is based on the Ziv-Lempel compression algorithm (ZL). We describe a simulation process of evolution to generate a set of sequences having evolved according to a random tree topology \(T\). This process allows both base substitution and fragment insertion/deletion, including horizontal transfers. The distances between the generated sequences are computed using the four formulas and the corresponding trees \(T'\) are reconstructed using Neighbor-Joining. \(T\) and \(T'\) are compared according to topological criteria. These comparisons show that the MSM distance outperforms the others whatever the parameters used to generate sequences. Finally, we test the MSM and KW distances on real biological data (i.e. prokaryotic complete genomes) and we compare the NJ trees to a Maximum Likelihood 16S + 23S RNA tree. We show that the MSM distance provides accurate results to study intra-phylum relationships, much better than those given by KW.


92D20 Protein sequences, DNA sequences
05C05 Trees
68R15 Combinatorics on words
90C27 Combinatorial optimization
92B10 Taxonomy, cladistics, statistics in mathematical biology


Full Text: DOI


[1] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403–410
[2] Amir A, Keselman D (1997) Maximum agreement subtree in a set of evolutionary trees: metric and efficient algorithms. SIAM J Comput 26: 1656–1669 · Zbl 0885.68071
[3] Deschavanne PJ, Giron A (1999) Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol 16(10): 1391–1399
[4] Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity . BMC Bioinformatics 5: 113
[5] Estabrook GF, McMorris FR, Meacham CA (1985) Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst Zool 34: 193–200
[6] Guindon S, Gascuel O (2003) A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52: 696–704
[7] Guyon F, Guénoche A (2009) An evolutionary distance based on maximal unique matches. Commun Stat (in press) · Zbl 1188.92027
[8] Hao BI, Qi J, Wang B (2003) Prokaryotic phylogeny based on complete genomes without sequence alignment. Modern Phys Lett B 17(2): 1–4 · Zbl 1114.92309
[9] Henz SR, Huson DH, Auch AF, Nieselt-Struwe K, Schuster SC (2005) Whole-genome prokaryotic phylogeny. Bioinformatics 15;21(10): 2329–2335
[10] Jeffrey HJ (1990) Chaos game representation of gene structure. Nucleic Acids Res 18(8): 2163–2170
[11] Karlin S, Taylor H (1981) A second course in stochastic processes. Academic Press, New York · Zbl 0469.60001
[12] Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16: 111–120
[13] Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL (2004) Versatile and open software for comparing large genomes. Genome Biol 5: R12
[14] Otu HH, Sayood K (2003) A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(16): 2122–2130
[15] Robinson DF, Foulds LR (1981) Comparison of phylogenetic trees. Math Biosci 53: 131–147 · Zbl 0451.92006
[16] Saitou N, Nei M (1987) The Neighbor-Joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4: 406–425
[17] Snel B, Huynen MA, Dutilh BE (2005) Genome trees and the nature of genome evolution. Annu Rev Microbiol 59: 191–209
[18] Ulitsky I, Burnstein D, Tuller T, Chor B (2006) The average common substring approach to phylogenomic reconstruction. J Comput Biol 13: 336–350
[19] Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inform Theory 23: 337–343 · Zbl 0379.94010
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.