GPU-accelerated Gibbs sampling: a case study of the horseshoe probit model. (English) Zbl 1430.62172

Summary: Gibbs sampling is a widely used Markov chain Monte Carlo (MCMC) method for numerically approximating integrals of interest in Bayesian statistics and other mathematical sciences. Many implementations of MCMC methods do not extend easily to parallel computing environments, as their inherently sequential nature incurs a large synchronization cost. In the case study illustrated by this paper, we show how to do Gibbs sampling in a fully data-parallel manner on a graphics processing unit, for a large class of exchangeable models that admit latent variable representations. Our approach takes a systems perspective, with emphasis placed on efficient use of compute hardware. We demonstrate our method on a Horseshoe Probit regression model and find that our implementation scales effectively to thousands of predictors and millions of data points simultaneously.


62J12 Generalized linear models (logistic models)
62R07 Statistical aspects of big data and data science
62A09 Graphical methods in statistics
65C05 Monte Carlo methods
Full Text: DOI arXiv


[1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems. Tech. Rep., Google (2017)
[2] Albert, JH; Chib, S., Bayesian analysis of binary and polychotomous response data, J. Am. Stat. Assoc., 88, 669-679, (1993) · Zbl 0774.62031
[3] Beam, AL; Ghosh, SK; Doyle, J., Fast Hamiltonian Monte Carlo using GPU computing, J. Comput. Graph. Stat., 25, 136-548, (2015)
[4] Blei, DM; Ng, AY; Jordan, MI, Latent Dirichlet allocation, J. Mach. Learn. Res., 3, 993-1022, (2003) · Zbl 1112.68379
[5] Breyer, L.; Roberts, GO; Rosenthal, JS, A note on geometric ergodicity and floating-point roundoff error, Stat. Prob. Lett., 53, 123-127, (2001) · Zbl 0991.60064
[6] Canny, J., Zhao, H.: Bidmach: large-scale learning with zero memory allocation. In: NIPS workshop on BigLearning (2013)
[7] Carvalho, CM; Polson, NG; Scott, JG, The horseshoe estimator for sparse signals, Biometrika, 97, 1-26, (2010) · Zbl 1406.62021
[8] Chamandy, N.; Muralidharan, O.; Wager, S., Teaching statistics at Google-scale, Am. Stat., 69, 283-291, (2015)
[9] Cheng, R., The generation of gamma variables with non-integral shape parameter, J. R. Stat. Soc. Ser. C (Appl. Stat.), 26, 71-75, (1977)
[10] Chopin, N., Ridgway, J.: Leave PIMA Indians alone: binary regression as a benchmark for Bayesian computation (2015). arXiv:1506.08640 · Zbl 1442.62007
[11] Finetti, B., La prévision: ses lois logiques, ses sources subjectives, Ann. l’inst. Henri Poincaré, 7, 1-68, (1937) · JFM 63.1070.02
[12] Geman, S.; Geman, D., Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell., 6, 721-741, (1984) · Zbl 0573.62030
[13] Hastings, WK, Monte Carlo sampling methods using Markov chains and their applications, Biometrika, 57, 97-109, (1970) · Zbl 0219.65008
[14] Hoffman, MD; Gelman, A., The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo, J. Mach. Learn. Res., 15, 1593-1623, (2014) · Zbl 1319.60150
[15] Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks (2014). arXiv:1404.5997
[16] Langford, J.: Vowpal Wabbit open source project. Tech. Rep., Yahoo (2007)
[17] L’Ecuyer, P.; Simard, R., TestU01: A C Library for empirical testing of random number generators, ACM Trans. Math. Softw., 33, 22, (2007) · Zbl 1365.65008
[18] Lee, A.; Yau, C.; Giles, MB; Doucet, A.; Holmes, CC, On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods, J. Comput. Graph. Stat., 19, 769-789, (2010)
[19] Magnusson, M., Jonsson, L., Villani, M., Broman, D.: Sparse partially collapsed MCMC for parallel inference in topic models (2015). arXiv:1506.03784
[20] Makalic, E., Schmidt, D.F.: A simple sampler for the horseshoe estimator (2015). arXiv:1508.03884
[21] Manssen, M.; Weigel, M.; Hartmann, AK, Random number generators for massively parallel simulations on GPU, Eur. Phys. J. Spec. Topics, 210, 53-71, (2012)
[22] Marsaglia, G., XORshift RNGs, J. Stat. Softw., 8, 1-6, (2003)
[23] Matloff, N.: Programming on Parallel Machines. University of California, Davis (2011)
[24] Matsumoto, M.; Nishimura, T., Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator, ACM Trans. Model. Comput. Simul., 8, 3-30, (1998) · Zbl 0917.65005
[25] Metropolis, N.; Rosenbluth, AW; Rosenbluth, MN; Teller, AH; Teller, E., Equation of state calculations by fast computing machines, J. Chem. Phys., 21, 1087-1092, (1953)
[26] Mingas, G.; Bouganis, CS, Parallel tempering MCMC acceleration using reconfigurable hardware, 227-238, (2012), Berlin
[27] Murray, LM; Lee, A.; Jacob, PE, Parallel resampling in the particle filter, J. Comput. Graph. Stat., 25, 789-805, (2016)
[28] Neubig, G., Goldberg, Y., Dyer, C.: On-the-fly operation batching in dynamic computation graphs. In: 31st Conference on Neural Information Processing Systems (2017)
[29] Nickolls, J.; Buck, I.; Garland, M.; Skadron, K., Scalable parallel programming with CUDA, ACM Queue, 6, 40-53, (2008)
[30] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In: Workshop on Automatic Differentiation, 31st Conference on Neural Information Processing Systems (2017)
[31] Robert, CP, Simulation of truncated normal variables, Stat. Comput., 5, 121-125, (1995)
[32] Salmon, J.K., Moraes, M.A., Dror, R.O., Shaw, D.E.: Parallel random numbers: as easy as 1, 2, 3. In: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-12 (2011)
[33] Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley, Berlin (2010)
[34] Seita, D., Chen, H., Canny, J.: Fast parallel SAME Gibbs sampling on general discrete Bayesian networks (2015). arXiv:1511.06416
[35] Stone, JE; Gohara, D.; Shi, G., OpenCL: a parallel programming standard for heterogeneous computing systems, Comput. Sci. Eng., 12, 66-73, (2010)
[36] Suchard, MA; Wang, Q.; Chan, C.; Frelinger, J.; Cron, A.; West, M., Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures, J. Comput. Graph. Stat., 19, 419-438, (2010)
[37] Terenin, A., Simpson, D., Draper, D.: Asynchronous Gibbs sampling (2016). arXiv:1509.08999
[38] Terenin, A., Magnusson, M., Jonsson, L., Draper, D.: Pólya urn latent Dirichlet allocation: a doubly sparse massively parallel sampler (2017). arXiv:1704.03581
[39] Typesafe Inc (2015) Akka Scala documentation. http://akka.io/
[40] Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S-PLUS. Springer, Berlin (2002) · Zbl 1006.62003
[41] Yan, F., Xu, N., Qi, Y.: Parallel inference for latent Dirichlet allocation on graphics processing units. In: Advances in Neural Information Processing Systems, vol. 20, pp. 2134-2142 (2009)
[42] Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, pp. 10-16 (2010)
[43] Zhan, Y., Kindratenko, V.: Evaluating AMD HIP toolset for migrating CUDA applications to AMD GPUS. Tech. Rep., National Center for Supercomputing Applications (2016)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.