×

zbMATH — the first resource for mathematics

A sticky HDP-HMM with application to speaker diarization. (English) Zbl 1232.62077
Summary: We consider the problem of speaker diarization, the problem of segmenting an audio recording of a meeting into temporal segments corresponding to individual speakers. The problem is rendered particularly difficult by the fact that we are not allowed to assume knowledge of the number of people participating in the meeting. To address this problem, we take a Bayesian nonparametric approach to speaker diarization that builds on the hierarchical Dirichlet process hidden Markov model (HDP-HMM) of Y.W. Teh et al. [J. Am. Stat. Assoc. 101, No. 476, 1566–1581 (2006; Zbl 1171.62349)]. Although the basic HDP-HMM tends to over-segment the audio data, creating redundant states and rapidly switching among them, we describe an augmented HDP-HMM that provides effective control over the switching rate. We also show that this augmentation makes it possible to treat emission distributions nonparametrically. To scale the resulting architecture to realistic diarization problems, we develop a sampling algorithm that employs a truncated approximation of the Dirichlet process to jointly resample the full state sequence, greatly improving mixing rates. Working with a benchmark NIST data set, we show that our Bayesian nonparametric architecture yields state-of-the-art speaker diarization results.

MSC:
62G99 Nonparametric inference
62F15 Bayesian inference
62P99 Applications of statistics
62M99 Inference from stochastic processes
62L12 Sequential estimation
65C60 Computational problems in statistics (MSC2010)
PDF BibTeX Cite
Full Text: DOI
References:
[1] Barras, C., Zhu, X., Meignier, S. and Gauvain, J.-L. (2004). Improving speaker diarization. In Proc. Fall 2004 Rich Transcription Workshop (RT-04) , November 2004.
[2] Beal, M. J. and Krishnamurthy, P. (2006). Gene expression time course clustering with countably infinite hidden Markov models. In Proc. Conference on Uncertainty in Artificial Intelligence , Cambridge, MA.
[3] Beal, M. J., Ghahramani, Z. and Rasmussen, C. E. (2002). The infinite hidden Markov model. In Advances in Neural Information Processing Systems 14 577-584. MIT Press, Cambridge, MA.
[4] Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. Ann. Statist. 1 353-355. · Zbl 0276.62010
[5] Chen, S. S. and Gopalakrishnam, P. S. (1998). Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In Proc. DARPA Broadcast News Transcription and Understanding Workshop 127-132. Morgan Kaufmann, San Francisco, CA.
[6] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209-230. · Zbl 0255.62037
[7] Fox, E. B., Sudderth, E. B., Jordan, M. I. and Willsky, A. S. (2008). An HDP-HMM for systems with state persistence. In Proc. International Conference on Machine Learning , Helsinki, Finland, July 2008.
[8] Fox, E. B., Sudderth, E. B., Jordan, M. I. and Willsky, A. S. (2009). Nonparametric Bayesian learning of switching dynamical systems. In Advances in Neural Information Processing Systems 21 457-464.
[9] Fox, E. B., Sudderth, E. B., Jordan, M. I. and Willsky, A. S. (2010). Supplement to “A sticky HDP-HMM with application to speaker diarization.” DOI: . · Zbl 1232.62077
[10] Gales, M. and Young, S. (2007). The Application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing 1 195-304. · Zbl 1145.68045
[11] Gauvain, J.-L., Lamel, L. and Adda, G. (1998). Partitioning and transcription of broadcast news data. In Proc. International Conference on Spoken Language Processing , Sydney, Australia 1335-1338.
[12] Hoffman, M., Cook, P. and Blei, D. (2008). Data-driven recomposition using the hierarchical Dirichlet process hidden Markov model. In Proc. International Computer Music Conference , Belfast, UK.
[13] Ishwaran, H. and Zarepour, M. (2000a). Markov chain Monte Carlo in approximate Dirichlet and beta two-parameter process hierarchical models. Biometrika 87 371-390. · Zbl 0949.62037
[14] Ishwaran, H. and Zarepour, M. (2002b). Dirichlet prior sieves in finite normal mixtures. Statist. Sinica 12 941-963. · Zbl 1002.62028
[15] Ishwaran, H. and Zarepour, M. (2002c). Exact and approximate sum-representations for the Dirichlet process. Canad. J. Statist. 30 269-283. · Zbl 1035.60048
[16] Jain, S. and Neal, R. M. (2004). A split-merge Markov chain Monte Carlo procedure for the dirichlet process mixture model. J. Comput. Graph. Statist. 13 158-182.
[17] Jasra, A., Holmes, C. C. and Stephens, D. A. (2005). Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statist. Sci. 20 50-67. · Zbl 1100.62032
[18] Johnson, M. (2007). Why doesn’t EM find good HMM POS-taggers. In Proc. Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , Prague, Czech Republic.
[19] Kivinen, J. J., Sudderth, E. B. and Jordan, M. I. (2007). Learning multiscale representations of natural scenes using Dirichlet processes. In Proc. International Conference on Computer Vision , Rio de Janeiro, Brazil 1-8.
[20] Kurihara, K., Welling, M. and Teh, Y. W. (2007). Collapsed variational Dirichlet process mixture models. In Proc. International Joint Conferences on Artificial Intelligence , Hyderabad, India.
[21] Meignier, S., Bonastre, J.-F., Fredouille, C. and Merlin, T. (2000). Evolutive HMM for multi-speaker tracking system. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Istanbul, Turkey, June 2000.
[22] Meignier, S., Bonastre, J.-F. and Igounet, S. (2001). E-HMM approach for learning and adapting sound models for speaker indexing. In Proc. Odyssey Speaker Language Recognition Workshop , June 2001.
[23] Munkres, J. (1957). Algorithms for the assignment and transportation problems. J. Soc. Industr. Appl. Math. 5 32-38. · Zbl 0131.36604
[24] NIST. Rich transcriptions database. Available at , 2007.
[25] Papaspiliopoulos, O. and Roberts, G. O. (2008). Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika 95 169-186. · Zbl 1437.62576
[26] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77 257-286.
[27] Reynolds, D. A. and Torres-Carrasquillo, P. A. (2004). The MIT Lincoln Laboratory RT-04F diarization systems: Applications to broadcast news and telephone conversations. In Proc. Fall 2004 Rich Transcription Workshop (RT-04) , November 2004.
[28] Robert, C. P. (2007). The Bayesian Choice . Springer, New York. · Zbl 1129.62003
[29] Rodriguez, A., Dunson, D. B. and Gelfand, A. E. (2008). The nested Dirichlet process. J. Amer. Statist. Assoc. 103 1131-1154. · Zbl 1205.62062
[30] Scott, S. L. (2002). Bayesian methods for hidden Markov models: Recursive computing in the 21st century. J. Amer. Statist. Assoc. 97 337-351. · Zbl 1073.65503
[31] Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 639-650. · Zbl 0823.62007
[32] Siegler, M., Jain, U., Raj, B. and Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proc. DARPA Speech Recognition Workshop 97-99. Morgan Kaufmann, San Francisco, CA.
[33] Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. 101 1566-1581. · Zbl 1171.62349
[34] Tranter, S. E. and Reynolds, D. A. (2006). An overview of automatic speaker diarization systems. IEEE Trans. Audio, Speech Language Process. 14 1557-1565.
[35] Van Gael, J., Saatci, Y., Teh, Y. W. and Ghahramani, Z. (2008). Beam sampling for the infinite hidden Markov model. In Proc. International Conference on Machine Learning , Helsinki, Finland, July 2008.
[36] Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Commun. Statist. Simul. Comput. 36 45-54. · Zbl 1113.62058
[37] Wooters, C. and Huijbregts, M. (2007). The ICSI RT07s speaker diarization system. Lecture Notes in Computer Science 4625 509-519.
[38] Wooters, C., Fung, J., Peskin, B. and Anguera, X. (2004). Towards robust speaker segmentation: The ICSI-SRI Fall 2004 diarization system. In Proc. Fall 2004 Rich Transcription Workshop (RT-04) , November 2004.
[39] Xing, E. P. and Sohn, K.-A. (2007). Hidden Markov Dirichlet process: Modeling genetic inference in open ancestral space. Bayesian Anal. 2 501-528. · Zbl 1332.62352
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.