×

A model selection approach for multiple sequence segmentation and dimensionality reduction. (English) Zbl 1403.62044

Summary: In this paper we consider the problem of segmenting \(n\) aligned random sequences of equal length \(m\) into a finite number of independent blocks. We propose a penalized maximum likelihood criterion to infer simultaneously the number of points of independence as well as the position of each point. We show how to compute exactly the estimator by means of a dynamic programming algorithm with time complexity \(O(m^2 n)\). We also propose another method, called hierarchical algorithm, that provides an approximation to the estimator when the sample size increases and runs in time \(O \{m \ln(m) n \}\). Our main theoretical results are the strong consistency of both estimators when the sample size \(n\) grows to infinity. We illustrate the convergence of these algorithms through some simulation examples and we apply the method to identify recombination hotspots in real SNPs data.

MSC:

62G05 Nonparametric estimation
62G20 Asymptotic properties of nonparametric inference
62H12 Estimation in multivariate analysis
92D20 Protein sequences, DNA sequences

Software:

PLINK
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Algama, M.; Keith, J. M., Investigating genomic structure using : A Bayesian segmentation model, Comput. Struct. Biotechnol. J., 10, 107-115, (2014)
[2] Bai, J.; Perron, P., Computation and analysis of multiple structural change models, J. Appl. Econometrics, 18, 1-22, (2003)
[3] Boys, R. J.; Henderson, D. A., A Bayesian approach to DNA sequence segmentation, Biometrics, 60, 573-581, (2004) · Zbl 1274.62728
[4] Breiman, L., Probability, (1992), SIAM Philadelphia, PA · Zbl 0753.60001
[5] Browning, S. R.; Browning, B. L., Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering, Am. J. Hum. Genet., 81, 1084-1097, (2007)
[6] Chang, C. C.; Chow, C. C.; Tellier, L. C.; Vattikuti, S.; Purcell, S. M.; Lee, J. J., Second-generation PLINK: rising to the challenge of larger and richer datasets, GigaScience, 4, 7, (2015)
[7] Csiszar, I.; Talata, Z., Context tree estimation for not necessarily finite memory processes, via BIC and MDL, IEEE Trans. Inform. Theory, 52, 1007-1016, (2006) · Zbl 1284.94027
[8] Deng, S.; Shi, Y.; Yuan, L.; Li, Y.; Ding, G., Detecting the borders between coding and non-coding DNA regions in prokaryotes based on recursive segmentation and nucleotide doublets statistics, BMC Genomics, 13, S19, (2012)
[9] Finkelstein, A.; Roytberg, M., Computation of biopolymers: A general approach to different problems, Biosystems, 30, 1-19, (1993)
[10] Fridlyand, J.; Snijders, A. M.; Pinkel, D.; Albertson, D. G.; Jain, A. N., Hidden Markov models approach to the analysis of array CGH data, J. Multivariate Anal., 90, 132-153, (2004) · Zbl 1047.92026
[11] Galves, A.; Galves, C.; García, J. E.; Garcia, N. L.; Leonardi, F., Context tree selection and linguistic rhythm retrieval from written texts, ann, Appl. Stat., 6, 186-209, (2012) · Zbl 1235.68305
[12] Gwadera, R.; Gionis, A.; Mannila, H., Optimal segmentation using tree models, Knowl. Inf. Syst., 15, 259-283, (2008)
[13] Haiminen, N.; Mannila, H., Discovering isochores by least-squares optimal segmentation, Gen, 394, 53-60, (2007)
[14] A second generation human haplotype map of over 3.1 million SNPs, Nature, 449, 851-861, (2007)
[15] Hawkins, D. M., Point estimation of the parameters of piecewise regression models, J. Roy. Statist. Soc. Ser. C Appl. Statist., 25, 51-57, (1976)
[16] Hoggart, C. J.; Whittaker, J. C.; De Iorio, M.; Balding, D. J., Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies, PLoS Genet., 4, 1-8, (2008)
[17] Li, J. Z.; Absher, D. M.; Tang, H.; Southwick, A. M.; Casto, A. M.; Ramachandran, S.; Cann, H. M.; Barsh, G. S.; Feldman, M.; Cavalli-Sforza, L. L.; Myers, R. M., Worldwide human relationships inferred from genome-wide patterns of variation, Science, 319, 1100-1104, (2008)
[18] Li, W.; Bernaola-Galván, P.; Haghighi, F.; Grosse, I., Applications of recursive segmentation to the analysis of DNA sequences, Comput. Chem., 26, 491-510, (2002)
[19] Pääbo, S., The mosaic that is our genome, Nature, 421, 409, (2003)
[20] Purcell, S.; Neale, B.; Todd-Brown, K.; Thomas, L.; Ferreira, M. A.; Bender, D.; Maller, J.; Sklar, P.; de Bakker, P. I.; Daly, M. J.; Sham, P. C., PLINK: A tool set for whole-genome association and population-based linkage analyses, Am. J. Hum. Genet., 81, 559-575, (2007)
[21] Ramensky, V. E.; Makeev, V. J.; Roytberg, M. A.; Tumanyan, V. G., DNA segmentation through the Bayesian approach, J. Comput. Biol., 7, 215-231, (2000)
[22] Tajima, F., Determination of window size for analyzing DNA sequences, J. Mol. Evol., 33, 470-473, (1991)
[23] Wen, S.-Y.; Zhang, C.-T., Identification of isochore boundaries in the human genome using the technique of wavelet multiresolution analysis, Biochem. Biophys. Res. Commun., 311, 215-222, (2003)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.