zbMATH — the first resource for mathematics

A decision-theoretic approach for segmental classification. (English) Zbl 1454.62206
Summary: This paper is concerned with statistical methods for the segmental classification of linear sequence data where the task is to segment and classify the data according to an underlying hidden discrete state sequence. Such analysis is commonplace in the empirical sciences including genomics, finance and speech processing. In particular, we are interested in answering the following question: given data \(y\) and a statistical model \(\pi(x,y)\) of the hidden states \(x\), what should we report as the prediction \(\hat{x}\) under the posterior distribution \(\pi(x|y)\)? That is, how should you make a prediction of the underlying states? We demonstrate that traditional approaches such as reporting the most probable state sequence or most probable set of marginal predictions can give undesirable classification artefacts and offer limited control over the properties of the prediction. We propose a decision theoretic approach using a novel class of Markov loss functions and report \(\hat{x}\) via the principle of minimum expected loss (maximum expected utility). We demonstrate that the sequence of minimum expected loss under the Markov loss function can be enumerated exactly using dynamic programming methods and that it offers flexibility and performance improvements over existing techniques. The result is generic and applicable to any probabilistic model on a sequence, such as Hidden Markov models, change point or product partition models.

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62C10 Bayesian problems; characterization of Bayes procedures
62M05 Markov processes: estimation; hidden Markov models
Full Text: DOI Euclid
[1] Banachewicz, K., Lucas, A. and van der Vaart, A. (2008). Modelling portfolio defaults using Hidden Markov models with covariates. Econom. J. 11 155-171. · Zbl 1135.91358
[2] Barry, D. and Hartigan, J. A. (1992). Product partition models for change point problems. Ann. Statist. 20 260-279. · Zbl 0780.62071
[3] Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis , 2nd ed. Springer, New York. · Zbl 0572.62008
[4] Bernardo, J. M. and Smith, A. F. M. (2000). Bayesian Theory . Wiley, New York. · Zbl 0943.62009
[5] Beroukhim, R., Mermel, C. H., Porter, D., Wei, G., Raychaudhuri, S., Donovan, J., Barretina, J., Boehm, J. S., Dobson, J., Urashima, M., Henry, K. T. M., Pinchback, R. M., Ligon, A. H., Cho, Y.-J., Haery, L., Greulich, H., Reich, M., Winckler, W., Lawrence, M. S., Weir, B. A., Tanaka, K. E., Chiang, D. Y., Bass, A. J., Loo, A., Hoffman, C., Prensner, J., Liefeld, T., Gao, Q., Yecies, D., Signoretti, S., Maher, E., Kaye, F. J., Sasaki, H., Tepper, J. E., Fletcher, J. A., Tabernero, J., Baselga, J., Tsao, M.-S., Demichelis, F., Rubin, M. A., Janne, P. A., Daly, M. J., Nucera, C., Levine, R. L., Ebert, B. L., Gabriel, S., Rustgi, A. K., Antonescu, C. R., Ladanyi, M., Letai, A., Garraway, L. A., Loda, M. and Beer, D. G. (2010). The landscape of somatic copy-number alteration across human cancers. Nature 463 899-905.
[6] Bignell, G. R., Greenman, C. D., Davies, H., Butler, A. P., Edkins, S., Andrews, J. M., Buck, G., Chen, L., Beare, D., Latimer, C., Widaa, S., Hinton, J., Fahey, C., Fu, B., Swamy, S., Dalgliesh, G. L., Teh, B. T., Deloukas, P., Yang, F., Campbell, P. J., Futreal, P. A. and Stratton, M. R. (2010). Signatures of mutation and selection in the cancer genome. Nature 463 893-898.
[7] Cancer Genome Atlas Network (2012). Comprehensive molecular characterization of human colon and rectal cancer. Nature 487 330-337.
[8] Carter, S. L., Cibulskis, K., Helman, E., McKenna, A., Shen, H., Zack, T., Laird, P. W., Onofrio, R. C., Winckler, W., Weir, B. A., Beroukhim, R., Pellman, D., Levine, D. A., Lander, E. S., Meyerson, M. and Getz, G. (2012). Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30 413-421.
[9] Chien, J. T. and Furui, S. (2005). Predictive hidden Markov model selection for speech recognition. IEEE Transactions on Speech and Audio Processing 13 377-387.
[10] Chopin, N. and Pelgrin, F. (2004). Bayesian inference and state number determination for hidden Markov models: An application to the information content of the yield curve about inflation. J. Econometrics 123 327-344. · Zbl 1084.62021
[11] Christie, M., Jorissen, R. N., Mouradov, D., Sakthianandeswaren, A., Li, S., Day, F., Tsui, C., Lipton, L., Desai, J., Jones, I. T., McLaughlin, S., Ward, R. L., Hawkins, N. J., Ruszkiewicz, A. R., Moore, J., Burgess, A. W., Busam, D., Zhao, Q., Strausberg, R. L., Simpson, A. J., Tomlinson, I. P. M., Gibbs, P. and Sieber, O. M. (2012). Different APC genotypes in proximal and distal sporadic colorectal cancers suggest distinct WNT/\(\beta\)-catenin signalling thresholds for tumourigenesis. Oncogene . .
[12] Curtis, C., Shah, S. P., Chin, S.-F., Turashvili, G., Rueda, O. M., Dunning, M. J., Speed, D., Lynch, A. G., Samarajiwa, S., Yuan, Y., Gräf, S., Ha, G., Haffari, G., Bashashati, A., Russell, R., McKinney, S., Group, M. E. T. A. B. R. I. C., Langerød, A., Green, A., Provenzano, E., Wishart, G., Pinder, S., Watson, P., Markowetz, F., Murphy, L., Ellis, I., Purushotham, A., Børresen-Dale, A.-L., Brenton, J. D., Tavaré, S., Caldas, C. and Aparicio, S. (2012). The genomic and transcriptomic architecture of 2000 breast tumours reveals novel subgroups. Nature 486 346-352.
[13] Day, N., Hemmaplardh, A., Thurman, R. E., Stamatoyannopoulos, J. A. and Noble, W. S. (2007). Unsupervised segmentation of continuous genomic data. Bioinformatics 23 1424-1426.
[14] Fearnhead, P. and Liu, Z. (2007). On-line inference for multiple changepoint problems. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 589-605.
[15] Giampieri, G., Davis, M. and Crowder, M. (2005). Analysis of default data using hidden Markov models. Quant. Finance 5 27-34. · Zbl 1118.91321
[16] Greenman, C. D., Bignell, G., Butler, A., Edkins, S., Hinton, J., Beare, D., Swamy, S., Santarius, T., Chen, L., Widaa, S., Futreal, P. A. and Stratton, M. R. (2010). PICNIC: An algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics 11 164-175.
[17] Kesten, H. (1976). Existence and uniqueness of countable one-dimensional Markov random fields. Ann. Probab. 4 557-569. · Zbl 0367.60080
[18] Knight, S. J. L., Yau, C., Clifford, R., Timbs, A. T., Sadighi Akha, E., Dréau, H. M., Burns, A., Ciria, C., Oscier, D. G., Pettitt, A. R., Dutton, S., Holmes, C. C., Taylor, J., Cazier, J.-B. and Schuh, A. (2012). Quantification of subclonal distributions of recurrent genomic aberrations in paired pre-treatment and relapse samples from patients with B-cell chronic lymphocytic leukemia. Leukemia 26 1564-1575.
[19] Lember, J. and Koloydenko, A. A. (2010). A generalized risk approach to path inference based on hidden Markov models. Preprint. Available at . 1007.3622 · Zbl 1318.62273
[20] Li, A., Liu, Z., Lezon-Geyda, K., Sarkar, S., Lannin, D., Schulz, V., Krop, I., Winer, E., Harris, L. and Tuck, D. (2011). GPHMM: An integrated hidden Markov model for identification of copy number alteration and loss of heterozygosity in complex tumor samples using whole genome SNP arrays. Nucleic Acids Res. 39 4928-4941.
[21] Loo, P. V. and Campbell, P. J. (2012). ABSOLUTE cancer genomics. Nat. Biotechnol. 30 620-621.
[22] Loo, P. V., Nordgard, S. H., Lingjærde, O. C., Russnes, H. G., Rye, I. H., Sun, W., Weigman, V. J., Marynen, P., Zetterberg, A., Naume, B., Perou, C. M., Børresen-Dale, A.-L. andKristensen, V. N. (2010). Allele-specific copy number analysis of tumors. Proc. Natl. Acad. Sci. USA 107 16910-16915.
[23] Majoros, W. H., Pertea, M. and Salzberg, S. L. (2004). TigrScan and GlimmerHMM: Two open source ab initio eukaryotic gene-finders. Bioinformatics 20 2878-2879.
[24] Murphy, K. P. (2002). Hidden semi-Markov models (hsmms). Technical report.
[25] Northcott, P. A., Shih, D. J. H., Peacock, J., Garzia, L., Morrissy, A. S., Zichner, T., Stütz, A. M., Korshunov, A., Reimand, J., Schumacher, S. E., Beroukhim, R., Ellison, D. W., Marshall, C. R., Lionel, A. C., Mack, S., Dubuc, A., Yao, Y., Ramaswamy, V., Luu, B., Rolider, A., Cavalli, F. M. G., Wang, X., Remke, M., Wu, X., Chiu, R. Y. B., Chu, A., Chuah, E., Corbett, R. D., Hoad, G. R., Jackman, S. D., Li, Y., Lo, A., Mungall, K. L., Nip, K. M., Qian, J. Q., Raymond, A. G. J., Thiessen, N. T., Varhol, R. J., Birol, I., Moore, R. A., Mungall, A. J., Holt, R., Kawauchi, D., Roussel, M. F., Kool, M., Jones, D. T. W., Witt, H., Fernandez-L, A., Kenney, A. M., Wechsler-Reya, R. J., Dirks, P., Aviv, T., Grajkowska, W. A. and Perek-Polnik, M. (2012). Subgroup-specific structural variation across 1000 medulloblastoma genomes. Nature 488 49-56.
[26] Popova, T., Manié, E., Stoppa-Lyonnet, D., Rigaill, G., Barillot, E. and Stern, M. H. (2009). Genome Alteration Print (GAP): A tool to visualize and mine complex cancer genomic profiles obtained by SNP arrays. Genome Biol. 10 R128.
[27] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE 77 257-286.
[28] Rossi, A. and Gallo, G. M. (2006). Volatility estimation via hidden Markov models. Journal of Empirical Finance 13 203-230.
[29] Rue, H. (1995). New loss functions in Bayesian imaging. J. Amer. Statist. Assoc. 90 900-908. · Zbl 0850.62933
[30] Sengupta, N., Yau, C., Sakthianandeswaren, A., Mouradov, D., Gibbs, P., Suraweera, N., Cazier, J.-B., Polanco-Echeverry, G., Ghosh, A., Thaha, M., Ahmed, S., Feakins, R., Propper, D., Dorudi, S., Sieber, O., Silver, A. and Lai, C. (2013). Analysis of colorectal cancers in British Bangladeshi identifies early onset, frequent mucinous histotype and a high prevalence of RBFOX1 deletion. Mol. Cancer 12 1.
[31] Siddiqi, S. M. and Moore, A. W. (2005). Fast inference and learning in large-state-space HMMs. In Proceedings of the 22 nd International Conference on Machine Learning ( Bonn , Germany ) 800-807. ACM, New York.
[32] Su, S. Y., Balding, D. J. and Coin, L. J. M. (2008). Disease association tests by inferring ancestral haplotypes using a hidden Markov model. Bioinformatics 24 972.
[33] Sun, W., Wright, F. A., Tang, Z., Nordgard, S. H., Loo, P. V., Yu, T., Kristensen, V. N. and Perou, C. M. (2009). Integrated study of copy number states and genotype calls using high-density SNP arrays. Nucleic Acids Res. 37 5365-5377.
[34] Weiss, R. J. and Ellis, D. P. W. (2008). Speech separation using speaker-adapted eigenvoice speech models. Computer Speech & Language 24 16-29.
[35] Yan, Q., Vaseghi, S., Zavarehei, E., Milner, B., Darch, J., White, P. and Andrianakis, I. (2007). Formant tracking linear prediction model using HMMs and Kalman filters for noisy speech processing. Computer Speech & Language 21 543-561.
[36] Yau, C., Mouradov, D., Jorissen, R. N., Colella, S., Mirza, G., Steers, G., Harris, A., Ragoussis, J., Sieber, O. and Holmes, C. C. (2010). A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data. Genome Biol. 11 R92.
[37] Zhang, Z., Lange, K., Ophoff, R. and Sabatti, C. (2010). Reconstructing DNA copy number by penalized estimation and imputation. Ann. Appl. Stat. 4 1749-1773. · Zbl 1220.62146
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.