SIMD parallel MCMC sampling with applications for big-data Bayesian analytics. (English) Zbl 1468.62131

Summary: Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential opportunities to parallelize techniques such as Monte Carlo Markov Chain (MCMC) sampling, and the development of general strategies for mapping such parallel algorithms to modern CPUs in order to elicit the performance up the compute-based and/or memory-based hardware limits. Two opportunities for Single-Instruction Multiple-Data (SIMD) parallelization of MCMC sampling for probabilistic graphical models are presented. In exchangeable models with many observations such as Bayesian Generalized Linear Models (GLMs), child-node contributions to the conditional posterior of each node can be calculated concurrently. In undirected graphs with discrete-value nodes, concurrent sampling of conditionally-independent nodes can be transformed into a SIMD form. High-performance libraries with multi-threading and vectorization capabilities can be readily applied to such SIMD opportunities to gain decent speedup, while a series of high-level source-code and runtime modifications provide further performance boost by reducing parallelization overhead and increasing data locality for Non-Uniform Memory Access architectures. For big-data Bayesian GLM graphs, the end-result is a routine for evaluating the conditional posterior and its gradient vector that is 5 times faster than a naive implementation using (built-in) multi-threaded Intel MKL BLAS, and reaches within the striking distance of the memory-bandwidth-induced hardware limit. Using multi-threading for cache-friendly, fine-grained parallelization can outperform coarse-grained alternatives which are often less cache-friendly, a likely scenario in modern predictive analytics workflow such as Hierarchical Bayesian GLM, variable selection, and ensemble regression and classification. The proposed optimization strategies improve the scaling of performance with number of cores and width of vector units (applicable to many-core SIMD processors such as Intel Xeon Phi and Graphic Processing Units), resulting in cost-effectiveness, energy efficiency (’green computing’), and higher speed on multi-core x86 processors.


62-08 Computational methods for problems pertaining to statistics
62F15 Bayesian inference
62R07 Statistical aspects of big data and data science
Full Text: DOI arXiv


[1] Ackley, D. H.; Hinton, G. E.; Sejnowski, T. J., A learning algorithm for Boltzmann machines, Cogn. Sci., 9, 147-169, (1985)
[2] Ahmed, A.; Aly, M.; Gonzalez, J.; Narayanamurthy, S.; Smola, A. J., Scalable inference in latent variable models, (Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, (2012), ACM), 123-132
[3] Antonio, K.; Beirlant, J., Actuarial statistics with generalized linear mixed models, Insurance Math. Econom., 40, 58-76, (2007) · Zbl 1104.62111
[4] Babapulle, M. N.; Joseph, L.; Bélisle, P.; Brophy, J. M.; Eisenberg, M. J., A hierarchical Bayesian meta-analysis of randomised clinical trials of drug-eluting stents, Lancet, 364, 583-591, (2004)
[5] Bache, K., Lichman, M., 2013. UCI Machine Learning Repository. URL: http://archive.ics.uci.edu/ml.
[6] Bernardo, J. M., The concept of exchangeability and its applications, Far East J. Math. Sci., 4, 111-122, (1996) · Zbl 0934.62011
[7] Bishop, C.M., 2006. Pattern recognition and machine learning. · Zbl 1107.68072
[8] Blei, D. M., Probabilistic topic models, Commun. ACM, 55, 77-84, (2012)
[9] Bovet, D. P.; Cesati, M., Understanding the Linux kernel, (2005), O’Reilly Media, Inc.
[10] Brockwell, A., Parallel Markov chain Monte Carlo simulation by pre-fetching, J. Comput. Graph. Statist., 15, 246-261, (2006)
[11] Broquedis, F.; Furmento, N.; Goglin, B.; Namyst, R.; Wacrenier, P. A., Dynamic task and data placement over NUMA architectures: an openmp runtime perspective, (Evolving OpenMP in an Age of Extreme Parallelism, (2009), Springer), 79-92
[12] Brush, S. G., History of the Lenz-Ising model, Rev. Modern Phys., 39, 883, (1967)
[13] Bryant, R.; O’Hallaron, D. R., Computer systems: A programmer’s perspective, (2011), Prentice Hall
[14] Carlin, B. P.; Gelfand, A. E.; Smith, A. F., Hierarchical Bayesian analysis of changepoint problems, Appl. Stat., 389-405, (1992) · Zbl 0825.62408
[15] Chandra, R., Parallel programming in openmp, (2001), Morgan Kaufmann
[16] Clint, M.; Weston, J.; Flannagan, J., Efficient Gram-Schmidt orthogonalisation on an array processor, (Parallel Processing: CONPAR 94—VAPP VI, (1994), Springer), 218-228
[17] da Silva, A. R.F., Cudabayesreg: parallel implementation of a Bayesian multilevel model for fMRI data analysis, J. Stat. Softw., 44, 1-24, (2011)
[18] Descombes, X.; Kruggel, F.; von Cramon, D. Y., Spatio-temporal fMRI analysis using Markov random fields, IEEE Trans. Med. Imaging, 17, 1028-1039, (1998)
[19] Diaconescu, R. E.; Zima, H. P., An approach to data distributions in chapel, Int. J. High Perform. Comput. Appl., 21, 313-335, (2007)
[20] Duane, S.; Kennedy, A. D.; Pendleton, B. J.; Roweth, D., Hybrid Monte Carlo, Phys. Lett. B, 195, 216-222, (1987)
[21] Dumont, M. D., Markov chain Monte Carlo on the GPU, (2011), Rochester Institute of Technology, (Ph.D. thesis)
[22] Fenton, N., Neil, M., 2004. Combining evidence in risk analysis using Bayesian networks. Agena White Paper W 704.
[23] Foster, I., Designing and building parallel programs, (1995), Addison Wesley Publishing Company
[24] Gelman, A.; Hill, J., Data analysis using regression and multilevel/hierarchical models, (2007), Cambridge University Press
[25] Geman, S.; Geman, D., Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell., 721-741, (1984) · Zbl 0573.62030
[26] Gilks, W. R.; Wild, P., Adaptive rejection sampling for Gibbs sampling, Appl. Stat., 337-348, (1992) · Zbl 0825.62407
[27] Girolami, M.; Calderhead, B., Riemann manifold Langevin and Hamiltonian Monte Carlo methods, J. R. Stat. Soc. Ser. B Stat. Methodol., 73, 123-214, (2011)
[28] Gonzalez, J., Low, Y., Gretton, A., Guestrin, C., 2011. Parallel Gibbs sampling: from colored fields to thin junction trees, in: International Conference on Artificial Intelligence and Statistics, pp. 324-332.
[29] Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C., 2012. PowerGraph: distributed graph-parallel computation on natural graphs, in: Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI, pp. 17-30.
[30] Gonzalez, J. E.; Low, Y.; Guestrin, C.; O’Hallaron, D., Distributed parallel inference on large factor graphs, (Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, (2009), AUAI Press), 203-212
[31] Hastie, T.; Tibshirani, R., Non-parametric logistic and proportional odds regression, Appl. Stat., 36, 260-276, (1987)
[32] Hastings, W. K., Monte Carlo sampling methods using Markov chains and their applications, Biometrika, 57, 97-109, (1970) · Zbl 0219.65008
[33] Hinton, G. E.; Osindero, S.; Teh, Y. W., A fast learning algorithm for deep belief nets, Neural Comput., 18, 1527-1554, (2006) · Zbl 1106.68094
[34] Hoffman, M. D.; Gelman, A., The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo, J. Mach. Learn. Res., 15, 1593-1623, (2014) · Zbl 1319.60150
[35] Hopfield, J. J., Neural networks and physical systems with emergent collective computational abilities, Proc. Natl. Acad. Sci., 79, 2554-2558, (1982) · Zbl 1369.92007
[36] Jeffers, J.; Reinders, J., Intel xeon phi coprocessor high performance programming, (2013), Newnes
[37] Kirk, D. B.; Wen-mei, W. H., Programming massively parallel processors: A hands-on approach, (2012), Newnes
[38] Kontoghiorghes, E. J., Ordinary linear model estimation on a massively parallel SIMD computer, Concurrency, Pract. Exp., 11, 323-341, (1999) · Zbl 0941.68803
[39] Kontoghiorghes, E., (Parallel Algorithms for Linear Models: Numerical Methods and Estimation Problems, Advances in Computational Economics, vol. 15, (2000), Springer Science & Business Media) · Zbl 0981.68176
[40] Kontoghiorghes, E. J., Handbook of parallel computing and statistics, (2005), CRC Press
[41] Kontoghiorghes, E. J.; Clarke, M., Solving the updated and downdated ordinary linear model on massively parallel SIMD systems, Parallel Algorithms Appl., 1, 243-252, (1993) · Zbl 1049.68551
[42] Kontoghiorghes, E. J.; Clint, M.; Naegeli, H. H., Recursive least-squares using a hybrid Householder algorithm on massively parallel SIMD systems, Parallel Comput., 25, 1147-1159, (1999) · Zbl 1050.65502
[43] Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In: NIPS, p. 4.
[44] Kumar, S.; Kim, D.; Smelyanskiy, M.; Chen, Y. K.; Chhugani, J.; Hughes, C. J.; Kim, C.; Lee, V. W.; Nguyen, A. D., Atomic vector operations on chip multiprocessors, (ACM SIGARCH Computer Architecture News, (2008), IEEE Computer Society), 441-452
[45] Le, Q. V.; Zou, W. Y.; Yeung, S. Y.; Ng, A. Y., Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis, (2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2011), IEEE), 3361-3368
[46] Lee, T. S.; Mumford, D., Hierarchical Bayesian inference in the visual cortex, J. Opt. Soc. Amer. A, 20, 1434-1448, (2003)
[47] Low, Y.; Bickson, D.; Gonzalez, J.; Guestrin, C.; Kyrola, A.; Hellerstein, J. M., Distributed graphlab: a framework for machine learning and data mining in the cloud, Proc. VLDB Endow., 5, 716-727, (2012)
[48] Mahani, A.S., Hasan, A., Jiang, M., Sharabiani, M.T., 2015. sns: Stochastic Newton Sampler (SNS). URL: http://CRAN.R-project.org/package=sns. R package version 1.0.0.
[49] Majo, Z.; Gross, T. R., Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead, (ACM SIGPLAN Notices, (2011), ACM), 11-20
[50] Marsaglia, G.; Tsang, W. W., A simple method for generating gamma variables, ACM Trans. Math. Software (TOMS), 26, 363-372, (2000) · Zbl 1365.65022
[51] McCool, M.; Reinders, J.; Robison, A., Structured parallel programming: patterns for efficient computation, (2012), Elsevier
[52] McCulloch, C. E., Generalized linear mixed models, (2006), Wiley Online Library
[53] Metropolis, N.; Rosenbluth, A. W.; Rosenbluth, M. N.; Teller, A. H.; Teller, E., Equation of state calculations by fast computing machines, J. Chem. Phys., 21, 1087-1092, (2004)
[54] Metropolis, N.; Ulam, S., The Monte Carlo method, J. Amer. Statist. Assoc., 44, 335-341, (1949) · Zbl 0033.28807
[55] Mohamed, A. R.; Sainath, T. N.; Dahl, G.; Ramabhadran, B.; Hinton, G. E.; Picheny, M. A., Deep belief networks using discriminative features for phone recognition, (2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2011), IEEE), 5060-5063
[56] Molina da Cruz, E. H.; Alves, Z.; Carissimi, A.; Navaux, P. O.A.; Ribeiro, C. P.; Mehaut, J., Using memory access traces to map threads and data on hierarchical multi-core platforms, (2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), (2011), IEEE), 551-558
[57] Neal, R. M., Bayesian learning for neural networks, (1995), University of Toronto, (Ph.D. thesis)
[58] Neal, R. M., Slice sampling, Ann. Statist., 705-741, (2003) · Zbl 1051.65007
[59] Nelder, J. A.; Baker, R., Generalized linear models, (1972), Wiley Online Library
[60] Newman, D.; Asuncion, A.; Smyth, P.; Welling, M., Distributed algorithms for topic models, J. Mach. Learn. Res., 10, 1801-1828, (2009) · Zbl 1235.68324
[61] Nocedal, J.; Wright, S., Numerical optimization, Springer Series in Operations Research and Financial Engineering, (2006), Springer-Verlag · Zbl 1104.65059
[62] Parkinson, D., Organisational aspects of using parallel computers, Parallel Comput., 5, 75-83, (1987)
[63] Perez, P., Markov random fields and images, CWI Q., 11, 413-437, (1998) · Zbl 0924.60080
[64] Plummer, M., 2013. Jags: Just another Gibbs sampler, version 3.4.0. URL: http://mcmc-jags.sourceforge.net/.
[65] Polikar, R., Ensemble based systems in decision making, IEEE Circuits Syst. Mag., 6, 21-45, (2006)
[66] Press, W. H., Numerical recipes 3rd edition: the art of scientific computing, (2007), Cambridge University Press
[67] Ren, R.; Orkoulas, G., Parallel Markov chain Monte Carlo simulations, J. Chem. Phys., 126, 211102, (2007)
[68] Ribeiro, C. P.; Méhaut, J.; Carissimi, A., Memory affinity management for numerical scientific applications over multi-core multiprocessors with hierarchical memory, (2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), (2010), IEEE), 1-4
[69] Ripley, B. D., Stochastic simulation. vol. 316, (2009), John Wiley & Sons
[70] Robert, C.P., Casella, G., 2004. Monte Carlo Statistical Methods. Volume 319. Citeseer. · Zbl 1096.62003
[71] Rossi, P. E.; Allenby, G. M.; McCulloch, R. E., Bayesian statistics and marketing, (2005), J. Wiley & Sons · Zbl 1094.62037
[72] Sharabiani, M. T.A.; Vermeulen, R.; Scoccianti, C.; Hosnijeh, F. S.; Minelli, L.; Sacerdote, C.; Palli, D.; Krogh, V.; Tumino, R.; Chiodini, P., Immunologic profile of excessive body weight, Biomarkers, 16, 243-251, (2011)
[73] Smolensky, P., 1986. Information processing in dynamical systems: foundations of harmony theory.
[74] Sobehart, J.R., Stein, R., Mikityanskaya, V., Li, L., 2000. Moody’s public firm risk model: A hybrid approach to modeling short term default risk. Moody’s Investors Service, Global Credit Research, Rating Methodology, March.
[75] Stan Development Team, 2014. Stan: A C++library for probability and sampling, version 2.5.0. URL: http://mc-stan.org/.
[76] Strid, I., Efficient parallelisation of metropolis-Hastings algorithms using a prefetching approach, Comput. Statist. Data Anal., 54, 2814-2835, (2010) · Zbl 1284.62036
[77] Sutter, J. M.; Kalivas, J. H., Comparison of forward selection, backward elimination, and generalized simulated annealing for variable selection, Microchem. J., 47, 60-66, (1993)
[78] Thomas, A.; O’Hara, B.; Ligges, U.; Sturtz, S., Making bugs open, R News, 6, 12-17, (2006)
[79] Tibbits, M. M.; Haran, M.; Liechty, J. C., Parallel multivariate slice sampling, Stat. Comput., 21, 415-430, (2011) · Zbl 1253.60082
[80] Wilkinson, D. J., Parallel Bayesian computation, (Kontoghiorghes, E. J., Handbook of Parallel Computing and Statistics, (2010), CRC Press)
[81] Xu, T., Ihler, A.T., 2011. Multicore Gibbs sampling in dense, unstructured graphs, in: International Conference on Artificial Intelligence and Statistics, pp. 798-806.
[82] Yu, L.; Xu, Y., A parallel Gibbs sampling algorithm for motif finding on GPU, (2009 IEEE International Symposium on Parallel and Distributed Processing with Applications, (2009), IEEE), 555-558
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.