Algorithmic patterns for \(\mathcal {H}\)-matrices on many-core processors. (English) Zbl 1434.65062

Summary: In this work, we consider the reformulation of hierarchical \((\mathcal {H})\) matrix algorithms for many-core processors with a model implementation on graphics processing units (GPUs). \(\mathcal {H}\) matrices approximate specific dense matrices, e.g., from discretized integral equations or kernel ridge regression, leading to log-linear time complexity in dense matrix-vector products. The parallelization of \(\mathcal {H}\) matrix operations on many-core processors is difficult due to the complex nature of the underlying algorithms. While previous algorithmic advances for many-core hardware focused on accelerating existing \(\mathcal {H}\) matrix CPU implementations by many-core processors, we here aim at totally relying on that processor type. As main contribution, we introduce the necessary parallel algorithmic patterns allowing to map the full \(\mathcal {H}\) matrix construction and the fast matrix-vector product to many-core hardware. Here, crucial ingredients are space filling curves, parallel tree traversal and batching of linear algebra operations. The resulting model GPU implementation hmglib is the, to the best of the authors knowledge, first entirely GPU-based Open Source \(\mathcal {H}\) matrix library of this kind. We investigate application examples as present in kernel ridge regression, Gaussian Process Regression and kernel-based interpolation. In this context, an in-depth performance analysis and a comparative performance study against a standard multi-core CPU \(\mathcal {H}\) matrix library highlights profound speedups of our many-core parallel approach.


65F99 Numerical linear algebra
65Y05 Parallel numerical computation
68T05 Learning and adaptive systems in artificial intelligence
65Y10 Numerical algorithms for specific classes of architectures
Full Text: DOI arXiv


[1] Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs. In: Proceedings of the International Conference on Supercomputing, ICS ’17, pp. 5:1-5:10. ACM, New York (2017)
[2] Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., Takahashi, T.: Task-based FMM for multicore architectures. SIAM J. Sci. Comput. 36(1), C66-C93 (2014) · Zbl 1323.78017
[3] Bebendorf, M.: AHMED Another software library on hierarchical matrices for elliptic differential equations https://github.com/xantares/ahmed. Accessed 31 Aug 2018
[4] Bebendorf, M.: Hierarchical Matrices—A Means to Efficiently Solve Elliptic Boundary Value Problems, Lecture Notes in Computational Science and Engineering, vol. 63. Springer, Berlin (2008) · Zbl 1151.65090
[5] Bebendorf, M., Kunis, S.: Recompression techniques for adaptive cross approximation. J. Integr. Equ. Appl. 21(3), 331-357 (2009) · Zbl 1180.65162
[6] Bebendorf, M., Rjasanow, S.: Adaptive low-rank approximation of collocation matrices. Computing 70(1), 1-24 (2003) · Zbl 1068.41052
[7] Bell, N., Hoberock, J.: Thrust: a productivity-oriented library for CUDA. GPU Comput. Gems Jade Ed. 2, 359-371 (2011)
[8] Bern, M., Eppstein, D., Teng, S.H.: Parallel construction of quadtrees and quality triangulations. Int. J. Comput. Geom. Appl. 09(06), 517-532 (1999) · Zbl 1074.68630
[9] Börm, \[S.: \cal{H}^2\] H2-matrices—multilevel methods for the approximation of integral operators. Comput. Vis. Sci. 7(3), 173-181 (2004) · Zbl 1070.65124
[10] Börm, S.: H2Lib, A Library for Hierarchical Matrices (2017). http://www.h2lib.org. Accessed 31 Aug 2018
[11] Börm, S., Christophersen, S.: Approximation of BEM Matrices Using GPGPUs. ArXiv e-prints (2015)
[12] Börm, S., Grasedyck, L., Hackbusch, W.: Introduction to hierarchical matrices with applications. Eng. Anal. Bound. Elem. 27(5), 405-422 (2003) · Zbl 1035.65042
[13] Boukaram, W., Ltaief, H., Litvinenko, A., Abdelfattah, A., Keyes, D.E.: Accelerating matrix – vector multiplication on hierarchical matrices using graphical processing units
[14] Boukaram, W.H., Turkiyyah, G., Ltaief, H., Keyes, D.E.: Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression. ArXiv e-prints (2017)
[15] Charara, A., Keyes, D., Ltaief, H.: Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs. ACM Trans. Math. Softw. 9(4), Art. No 39 (2017) · Zbl 1471.65028
[16] Fasshauer, G.F.: Meshfree Approximation Methods with MATLAB. World Scientific Publishing Co., Inc, River Edge (2007) · Zbl 1123.65001
[17] Garanzha, K., Pantaleoni, J., McAllister, D.: Simpler and faster HLBVH with work queues. In: Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG ’11, pp. 59-64. ACM, New York (2011)
[18] Ghysels, P., Li, X.S., Rouet, F., Williams, S., Napov, A.: An efficient multicore implementation of a novel HSS-structured multifrontal solver using randomized sampling. SIAM J. Sci. Comput. 38(5), S358 (2016) · Zbl 1352.65092
[19] Grasedyck, L., Kriemann, R., Le Borne, S.: Parallel black box-LU preconditioning for elliptic boundary value problems. Comput. Vis. Sci. 11(4), 273-291 (2008)
[20] Greengard, L., Rokhlin, V.: A new version of the fast multipole method for the Laplace equation in three dimensions. Acta Numer. 6, 229-269 (1997) · Zbl 0889.65115
[21] Hackbusch, W.: Hierarchical Matrices : Algorithms and Analysis, Springer Series in Computational Mathematics, vol. 49. Springer, Berlin (2015) · Zbl 1336.65041
[22] Hackbusch, W.: Survey on the technique of hierarchical matrices. Vietnam J. Math. 44(1), 71-101 (2016) · Zbl 1336.65028
[23] Hackbusch, W., Börm, \[S.: \cal{H}^2\] H2-matrix approximation of integral operators by interpolation. Appl. Numer. Math. 43(1-2), 129-143 (2002) · Zbl 1019.65103
[24] Hackbusch, W., Khoromskij, B., Sauter, S.A.: On \[\cal{H}^2\] H2-matrices. In: Lectures on Applied Mathematics: Proceedings of the Symposium Organized by the Sonderforschungsbereich 438 on the Occasion of Karl-Heinz Hoffmanns 60th Birthday, Munich, 30 June-1 July 1999, p. 9. Springer (2000)
[25] Hackbusch, W., Nowak, Z.P.: On the fast matrix multiplication in the boundary element method by panel clustering. Numer. Math. 54(4), 463-491 (1989) · Zbl 0641.65038
[26] Kriemann, R.: Parallel \[\cal{H}H\]-matrix arithmetics on shared memory systems. Computing 74(3), 273-297 (2005) · Zbl 1071.65050
[27] Kriemann, R.: \[ \cal{H}H\]-LU factorization on many-core systems. Comput. Vis. Sci. 16(3), 105-117 (2013) · Zbl 1388.65210
[28] Kriemann, R.:
[[\cal{H}H-Lib^{ pro \]}\] Libpro (website) (2017). http://www.hlibpro.com. Accessed 31 Aug 2018
[29] Lauterbach, C., Garland, M., Sengupta, S., Luebke, D., Manocha, D.: Fast BVH construction on GPUs. Comput. Graph. Forum 28(2), 375-384 (2009)
[30] March, W.B., Xiao, B., Yu, C., Biros, G.: ASKIT: an efficient, parallel library for high-dimensional kernel summations. SIAM J. Sci. Comput. 38, S720-S749 (2016) · Zbl 1416.65578
[31] Merrill, D., Garland, M., Grimshaw, A.: Scalable GPU graph traversal. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’12, pp. 117-128. ACM, New York (2012)
[32] Morton, G.: A computer oriented geodetic data base and a new technique in file sequencing. Technical Report Ottawa, Ontario, Canada (1966)
[33] Poulson, J.: DMHM—Distributed-Memory Hierarchical Matrices. https://bitbucket.org/poulson/dmhm. Accessed 31 Aug 2018
[34] Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). MIT Press, Cambridge (2005)
[35] Rouet, F.H., Li, X.S., Ghysels, P., Napov, A.: A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Trans. Math. Softw. 42(4), Art. No 27 (2016) · Zbl 1369.65043
[36] Sheng, Z., Dewilde, P., Chandrasekaran, S.: Algorithms to Solve Hierarchically Semi-separable Systems, pp. 255-294. Birkhäuser Basel, Basel (2007) · Zbl 1123.65020
[37] Szuppe, J.: Boost.Compute: a parallel computing library for C++ based on OpenCL. In: Proceedings of the 4th International Workshop on OpenCL, IWOCL ’16, pp. 15:1-15:39. ACM, New York (2016)
[38] Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103-111 (1990)
[39] Vovk, V.: Kernel Ridge Regression, pp. 105-116. Springer, Berlin (2013) · Zbl 1325.62094
[40] Wendland, H.: Scattered Data Approximation. Cambridge University Press, Cambridge (2004) · Zbl 1185.65022
[41] Yalamanchili, P., Arshad, U., Mohammed, Z., Garigipati, P., Entschev, P., Kloppenborg, B., Malcolm, J., Melonakos, J.: ArrayFire—a high performance software library for parallel computing with an easy-to-use API (2015). https://github.com/arrayfire/arrayfire. Accessed 31 Aug 2018
[42] Yokota, R., Barba, L.: FMM-based vortex method for simulation of isotropic turbulence on GPUs, compared with a spectral method. Comput. Fluids 80, 17-27 (2013) · Zbl 1284.76184
[43] Yokota, R., Barba, L., Knepley, M.G.: PetRBF: a parallel O(N) algorithm for radial basis function interpolation with Gaussians. Comput. Methods Appl. Mech. Eng. 199(25), 1793-1804 (2010) · Zbl 1231.65026
[44] Zaspel, P.: MPLA—Massively Parallel Linear Algebra (2017). https://github.com/zaspel/MPLA. Accessed 31 Aug 2018
[45] Zaspel, P.: hmglib—Hierarchical Matrices on GPU(s) Library (2017). https://github.com/zaspel/hmglib. Accessed 31 Aug 2018
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.