×

Poisson approximation for the number of repeats in a stationary Markov chain. (English) Zbl 1140.62013

Summary: Detection of repeated sequences within complete genomes is a powerful tool to help understanding genome dynamics and species evolutionary history. To distinguish significant repeats from those that can be obtained just by chance, statistical methods have to be developed. We show that the distribution of the number of long repeats in long sequences generated by stationary Markov chains can be approximated by a Poisson distribution with explicit parameter. Thanks to the Chen-Stein method we provide a bound for the approximation error; this bound converges to 0 as soon as the length \(n\) of the sequence tends to \(\infty \) and the length \(t\) of the repeats satisfies \(n^2\rho^t = O(1)\) for some \(0 < \rho < 1\). Using this Poisson approximation, \(p\)-values can then be easily calculated to determine if a given genome is significantly enriched in repeats of length \(t\).

MSC:

62E17 Approximations to statistical distributions (nonasymptotic)
62P10 Applications of statistics to biology and medical sciences; meta analysis
62M99 Inference from stochastic processes
92C40 Biochemistry, molecular biology
60C05 Combinatorial probability

Software:

REPuter
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Arratia, R., Martin, D., Reinert, G. and Waterman, M. (1996). Poisson process approximation for sequence repeats and sequencing by hybridization. J. Comput. Biol. 3 , 425–463.
[2] Barbour, A., Holst, L. and Janson, S. (1992). Poisson Approximation . Clarendon Press, Oxford. · Zbl 0746.60002
[3] Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27 , 573–580.
[4] Delcher, A. L. \et (1999). Alignment of whole genomes. Nucleic Acids Res. 27 , 2369–2376.
[5] Kolpakov, R., Bana, G. and Kucherov, G. (2003). Mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. 31 , 3672–3678.
[6] Kurtz, S. \et (2001). Reputer: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29 , 4633–4642.
[7] Lefèbvre, A., Lecroq, T., Dauchel, H. and Alexandre, J. (2003). FORRepeats: detects repeats on entire chromosomes and between genomes. Bioinformatics 19 , 319–326.
[8] Reinert, G., Schbath, S. and Waterman, M. (2000). Probabilistic and statistical properties of words: an overview. J. Comput. Biol . 7 , 1–46.
[9] Robin, S., Rodolphe, F. and Schbath, S. (2005). DNA, Words and Models . Cambridge University Press. · Zbl 1185.92047
[10] Taylor, J.S. and Raes, J. (2004). Duplication and divergence: the evolution of new genes and old ideas. Ann. Rev. Genet. 38 , 615–643.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.