×

Randomized GPU algorithms for the construction of hierarchical matrices from matrix-vector operations. (English) Zbl 07099326

Summary: Randomized algorithms for the generation of low rank approximations of large dense matrices have become popular methods in scientific computing and machine learning. In this paper, we extend the scope of these methods and present batched GPU randomized algorithms for the efficient generation of low rank representations of large sets of small dense matrices, as well as their generalization to the construction of hierarchically low rank symmetric \(\mathcal{H}^2\) matrices with general partitioning structures. In both cases, the algorithms need to access the matrices only through matrix-vector multiplication operations which can be done in blocks to increase the arithmetic intensity and substantially boost the resulting performance. The batched GPU kernels are adaptive, allow nonuniform sizes in the matrices of the batch, and are more effective than SVD factorizations on matrices with fast decaying spectra. The hierarchical matrix generation consists of two phases, interleaved at every level of the matrix hierarchy. A first phase adaptively generates low rank approximations of matrix blocks through randomized matrix-vector sampling. A second phase accumulates and compresses these blocks into a hierarchical matrix that is incrementally constructed. The accumulation expresses the low rank blocks of a given level as a set of local low rank updates that are performed simultaneously on the whole matrix allowing high-performance batched kernels to be used in the compression operations. When the ranks of the blocks generated in the first phase are too large to be processed in a single operation, the low rank updates can be split into smaller-sized updates and applied in sequence. Assuming representative rank \(k\), the resulting matrix has optimal \(O(kN)\) asymptotic storage complexity because of the nested bases it uses. The ability to generate an \(\mathcal{H}^2\) matrix from matrix-vector products allows us to support a general randomized matrix-matrix multiplication operation, an important kernel in hierarchical matrix computations. Numerical experiments demonstrate the high performance of the algorithms and their effectiveness in generating hierarchical matrices to a desired target accuracy.

MSC:

65Fxx Numerical linear algebra
68N19 Other programming paradigms (object-oriented, sequential, concurrent, automatic, etc.)
68W10 Parallel algorithms in computer science
68W20 Randomized algorithms
68W25 Approximation algorithms
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] M. Abadi et al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, www.tensorflow.org/about/bib (2011).
[2] M. Bebendorf and S. Rjasanow, Adaptive low-rank approximation of collocation matrices, Computing, 70 (2003), pp. 1-24. · Zbl 1068.41052
[3] S. Börm, Efficient Numerical Methods for Non-Local Operators: \( \mathcal{H}^2\)-Matrix Compression, Algorithms and Analysis, EMS Tracts Math. 14, European Mathematical Society, Zürich, 2010. · Zbl 1208.65037
[4] S. Börm and J. Garcke, Approximating Gaussian processes with \(\mathcal{H}^2\)-matrices, in European Conference on Machine Learning, Springer, New York, 2007, pp. 42-53.
[5] S. Börm and K. Reimer, Efficient arithmetic operations for rank-structured matrices based on hierarchical low-rank updates, Comput. Vis. Sci., 16 (2013), pp. 247-258. · Zbl 1380.65461
[6] W. Boukaram, G. Turkiyyah, and D. Keyes, Hierarchical matrix operations on GPUs: Matrix-vector multiplication and compression, ACM Trans. Math. Software, 45 (2019), pp. 3:1-3:28. · Zbl 1471.65027
[7] W. H. Boukaram, G. Turkiyyah, H. Ltaief, and D. E. Keyes, Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression, Parallel Comput., 74 (2018), pp. 19-33.
[8] T. F. Coleman, B. S. Garbow, and J. J. More, Software for estimating sparse Jacobian matrices, ACM Trans. Math. Software, 10 (1984), pp. 329-345. · Zbl 0548.65006
[9] T. F. Coleman, B. S. Garbow, and J. J. Moré, Software for estimating sparse Hessian matrices, ACM Trans. Math. Software, 11 (1985), pp. 363-377. · Zbl 0584.65010
[10] R. Collobert, K. Kavukcuoglu, and C. Farabet, Torch7: A MATLAB-like environment for machine learning, presented at Neural Information Processing Systems, 2011.
[11] A. R. Curtis, M. J. D. Powell, and J. K. Reid, On the estimation of sparse Jacobian matrices, IMA J. Appl. Math., 13 (1974), pp. 117-119. · Zbl 0273.65036
[12] J. W. Daniel, W. B. Gragg, L. Kaufman, and G. W. Stewart, Reorthogonalization and stable algorithms for updating the Gram-Schmidt QR factorization, Math. Comput., 30 (1976), pp. 772-795. · Zbl 0345.65021
[13] A. Gebremedhin, F. Manne, and A. Pothen, What color is your Jacobian? Graph coloring for computing derivatives, SIAM Rev., 47 (2005), pp. 629-705. · Zbl 1076.05034
[14] P. Ghysels, X. Li, F. Rouet, S. Williams, and A. Napov, An efficient multicore implementation of a novel HSS-structured multifrontal solver using randomized sampling, SIAM J. Sci. Comput., 38 (2016), pp. S358-S384. · Zbl 1352.65092
[15] L. Grasedyck and W. Hackbusch, Construction and arithmetics of \(\mathcal{H} \)-matrices, Computing, 70 (2003), pp. 295-334. · Zbl 1030.65033
[16] W. Hackbusch, Hierarchical Matrices: Algorithms and Analysis, Springer, New York, 2015.
[17] N. Halko, P. Martinsson, and J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, SIAM Rev., 53 (2011), pp. 217-288. · Zbl 1269.65043
[18] S. Hao and P.-G. Martinsson, A direct solver for elliptic pdes in three dimensions based on hierarchical merging of Poincaré–Steklov operators, J. Comput. Appl. Math., 308 (2016), pp. 419-434. · Zbl 1346.65062
[19] Y. Hida, X. S. Li, and D. H. Bailey, Algorithms for quad-double precision floating point arithmetic, in Proceedings of the 15th IEEE Symposium on Computer Arithmetic, IEEE, 2001, pp. 155-162.
[20] L. Lin, J. Lu, and L. Ying, Fast construction of hierarchical matrix representation from matrix-vector multiplication, J. Comput. Phys., 230 (2011), pp. 4071-4087. · Zbl 1218.65038
[21] M. Lu, B. He, and Q. Luo, Supporting extended precision on graphics processors, in Proceedings of the Sixth International Workshop on Data Management on New Hardware, ACM, 2010, pp. 19-26.
[22] P. Martinsson, A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix, SIAM J. Matrix Anal. Appl., 32 (2011), pp. 1251-1274. · Zbl 1237.65028
[23] P. Martinsson, Compressing rank-structured matrices via randomized sampling, SIAM J. Sci. Comput., 38 (2016), pp. A1959-A1986.
[24] P. Martinsson and S. Voronin, A randomized blocked algorithm for efficiently computing rank-revealing factorizations of matrices, SIAM J. Sci. Comput., 38 (2016), pp. S485-S507. · Zbl 1352.65095
[25] R. Nath, S. Tomov, and J. Dongarra, Accelerating GPU kernels for dense linear algebra, in Proceedings of the 2009 International Meeting on High Performance Computing for Computational Science, Berkeley, CA, Springer, New York, 2010. · Zbl 1323.65140
[26] cuRANDLibrary Programming Guide, Version 8.0, NVIDIA, 2018, https://docs.nvidia.com/cuda/curand.
[27] F. Rouet, X. Li, P. Ghysels, and A. Napov, A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization, ACM Trans. Math. Software, 42 (2016), pp. 27:1-27:35. · Zbl 1369.65043
[28] A. Stathopoulos and K. Wu, A block orthogonalization procedure with constant synchronization requirements, SIAM J. Sci. Comput., 23 (2002), pp. 2165-2182. · Zbl 1018.65050
[29] G. W. Stewart, Block Gram-Schmidt orthogonalization, SIAM J. Sci. Comput., 31 (2008), pp. 761-775. · Zbl 1185.65069
[30] S. Tomov, J. Dongarra, and M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Comput., 36 (2010), pp. 232-240. · Zbl 1204.68268
[31] J. Xia, Randomized sparse direct solvers, SIAM J. Matrix Anal. Appl., 34 (2013), pp. 197-227. · Zbl 1269.65029
[32] I. Yamazaki, S. Tomov, and J. Dongarra, Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs, SIAM J. Sci. Comput., 37 (2015), pp. C307-C330. · Zbl 1320.65046
[33] C. D. Yu, J. Levitt, S. Reiz, and G. Biros, Geometry-oblivious FMM for compressing dense SPD matrices, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 53:1-53:14.
[34] C. D. Yu, S. Reiz, and G. Biros, Distributed-memory hierarchical compression of dense SPD matrices, in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, 2018, pp. 15:1-15:15.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.