×

An efficient multicore implementation of a novel HSS-structured multifrontal solver using randomized sampling. (English) Zbl 1352.65092

Summary: We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to sevenfold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK (STRUctured Matrices PACKage), which also has a distributed memory component for dense rank-structured matrices.

MSC:

65F05 Direct numerical methods for linear systems and matrix inversion
65F50 Computational methods for sparse matrices
65Y05 Parallel numerical computation
65Y20 Complexity and performance of numerical algorithms
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] E. Agullo, A. Buttari, A. Guermouche, and F. Lopez, {\it Multifrontal QR factorization for multicore architectures over runtime systems}, in Euro-Par 2013 Parallel Processing, Lecture Notes in Comput. Sci. 8097, Springer-Verlag, Berlin, Heidelberg, 2013, pp. 521-532. · Zbl 1369.65062
[2] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov, {\it Numerical linear algebra on emerging architectures: The plasma and magma projects}, J. Phys. Conf. Ser., 180 (2009), 012037.
[3] S. Ambikasaran, {\it Fast Algorithms for Dense Numerical Linear Algebra and Applications}, Ph.D. thesis, Stanford, Palo Alto, CA, 2013.
[4] P. R. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J.-Y. L’Excellent, and C. Weisbecker, {\it Improving multifrontal methods by means of block low-rank representations}, SIAM J. Sci. Comput., 37 (2015), pp. A1451-A1474, . · Zbl 1314.05111
[5] P. R. Amestoy, T. A. Davis, and I. S. Duff, {\it An approximate minimum degree ordering algorithm}, SIAM J. Matrix Anal. Appl., 17 (1996), pp. 886-905, . · Zbl 0861.65021
[6] P. R. Amestoy, I. S. Duff, J.-Y. L’Excellent, and J. Koster, {\it A fully asynchronous multifrontal solver using distributed dynamic scheduling}, SIAM J. Matrix Anal. Appl., 23 (2001), pp. 15-41, . · Zbl 0992.65018
[7] P. R. Amestoy, I. S. Duff, D. Ruiz, and B. Uçar, {\it A parallel matrix scaling algorithm}, in High Performance Computing for Computational Science-VECPAR 2008, Lecture Notes in Comput. Sci. 5336, Springer, Berlin, Heidelberg, 2008, pp. 301-313. · Zbl 1204.65040
[8] A. Aminfar, S. Ambikasaran, and E. Darve, {\it A fast block low-rank dense solver with applications to finite-element matrices}, J. Comput. Phys., 304 (2016), pp. 170-188. · Zbl 1349.65595
[9] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, {\it StarPU: A unified platform for task scheduling on heterogeneous multicore architectures}, Concurr. Comput., 23 (2011), pp. 187-198.
[10] M. Bebendorf, {\it Approximation of boundary element matrices}, Numer. Math., 86 (2000), pp. 565-589. · Zbl 0966.65094
[11] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou, {\it Cilk: An efficient multithreaded runtime system}, in PPOPP ’95, Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, New York, 1996, pp. 207-216.
[12] R. D. Blumofe and C. E. Leiserson, {\it Scheduling multithreaded computations by work stealing}, J. ACM, 46 (1999), pp. 720-748. · Zbl 1065.68504
[13] S. Börm, L. Grasedyck, and W. Hackbusch, {\it Introduction to hierarchical matrices with applications}, Eng. Anal. Bound. Elem., 27 (2003), pp. 405-422. · Zbl 1035.65042
[14] G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier, and J. Dongarra, {\it Dague: A generic distributed dag engine for high performance computing}, Parallel Comput., 38 (2012), pp. 37-51.
[15] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, {\it A class of parallel tiled linear algebra algorithms for multicore architectures}, Parallel Comput., 35 (2009), pp. 38-53.
[16] T. F. Chan, {\it Rank revealing QR factorizations}, Linear Algebra Appl., 88 (1987), pp. 67-82. · Zbl 0624.65025
[17] S. Chandrasekaran, P. Dewilde, M. Gu, and N. Somasunderam, {\it On the numerical rank of the off-diagonal blocks of Schur complements of discretized elliptic PDEs}, SIAM J. Matrix Anal. Appl., 31 (2010), pp. 2261-2290, . · Zbl 1209.65032
[18] S. Chandrasekaran, M. Gu, and T. Pals, {\it A fast ULV decomposition solver for hierarchically semiseparable representations}, SIAM J. Matrix Anal. Appl., 28 (2006), pp. 603-622, . · Zbl 1120.65031
[19] H. Cheng, Z. Gimbutas, P.-G. Martinsson, and V. Rokhlin, {\it On the compression of low rank matrices}, SIAM J. Sci. Comput., 26 (2005), pp. 1389-1404, . · Zbl 1083.65042
[20] A. R. Curtis and J. K. Reid, {\it On the automatic scaling of matrices for Gaussian elimination}, J. Inst. Math. Appl., 10 (1972), pp. 118-124. · Zbl 0249.65026
[21] J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek, {\it Achieving numerical accuracy and high performance using recursive tile lu factorization with partial pivoting}, Concurr. Comput., 26 (2014), pp. 1408-1431.
[22] I. S. Duff and J. Koster, {\it The design and use of algorithms for permuting large entries to the diagonal of sparse matrices}, SIAM J. Matrix Anal. Appl., 20 (1999), pp. 889-901, . · Zbl 0947.65048
[23] I. S. Duff and J. K. Reid, {\it The multifrontal solution of indefinite sparse symmetric linear}, ACM Trans. Math. Software, 9 (1983), pp. 302-325. · Zbl 0515.65022
[24] A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas, {\it Ompss: A proposal for programming heterogeneous multi-core architectures}, Parallel Process. Lett., 21 (2011), pp. 173-193.
[25] B. Engquist and L. Ying, {\it Sweeping preconditioner for the Helmholtz equation: Hierarchical matrix representation}, Comm. Pure Appl. Math., 64 (2011), pp. 697-735. · Zbl 1229.35037
[26] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, {\it Cache-oblivious algorithms}, in 40th Annual Symposium on Foundations of Computer Science, IEEE, Washington, DC, 1999, pp. 285-297. · Zbl 1295.68236
[27] S. Ghemawat and P. Menage, {\it Tcmalloc: Thread-Caching Malloc}, (2009).
[28] M. Gu and S. C. Eisenstat, {\it Efficient algorithms for computing a strong rank-revealing QR factorization}, SIAM J. Sci. Comput., 17 (1996), pp. 848-869, . · Zbl 0858.65044
[29] N. Halko, P.-G. Martinsson, and J. A. Tropp, {\it Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions}, SIAM Rev., 53 (2011), pp. 217-288, . · Zbl 1269.65043
[30] P. Hénon, P. Ramet, and J. Roman, {\it PaStiX: A high-performance parallel direct solver for sparse symmetric positive definite systems}, Parallel Comput., 28 (2002), pp. 301-321. · Zbl 0984.68208
[31] J. D. Hogg, J. K. Reid, and J. A. Scott, {\it Design of a multicore sparse Cholesky factorization using DAGs}, SIAM J. Sci. Comput., 32 (2010), pp. 3627-3649, . · Zbl 1221.65088
[32] G. Karypis and V. Kumar, {\it A fast and high quality multilevel scheme for partitioning irregular graphs}, SIAM J. Sci. Comput., 20 (1998), pp. 359-392, . · Zbl 0915.68129
[33] K. Kim and V. Eijkhout, {\it A parallel sparse direct solver via hierarchical DAG scheduling}, ACM Trans. Math. Software, 41 (2014), 3. · Zbl 1369.65046
[34] R. Kriemann, {\it \(\mathcal{H} \)-LU factorization on many-core systems}, Comput. Vis. Sci., 16 (2013), pp. 105-117. · Zbl 1388.65210
[35] X. Lacoste, M. Faverge, G. Bosilca, P. Ramet, and S. Thibault, {\it Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes}, in 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW), IEEE, Washington, DC, 2014, pp. 29-38.
[36] J.-Y. L’Excellent, {\it Multifrontal Methods: Parallelism, Memory Usage and Numerical Aspects}, Habilitation à Diriger des Recherches, École Normale Supérieure de Lyon, Lyon, France, 2012.
[37] L. Lin, J. Lu, and L. Ying, {\it Fast construction of hierarchical matrix representation from matrix-vector multiplication}, J. Comput. Phys., 230 (2011), pp. 4071-4087. · Zbl 1218.65038
[38] J. W. H. Liu, {\it The multifrontal method for sparse matrix solution: Theory and practice}, SIAM Rev., 34 (1992), pp. 82-109, . · Zbl 0919.65019
[39] P.-G. Martinsson, {\it A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix}, SIAM J. Matrix Anal. Appl., 32 (2011), pp. 1251-1274, . · Zbl 1237.65028
[40] J. D. McCalpin, {\it STREAM: Sustainable Memory Bandwidth in High Performance Computers}, .
[41] M. McCool, J. Reinders, and A. Robison, {\it Structured Parallel Programming: Patterns for Efficient Computation}, Morgan Kaufmann, San Francisco, CA, 2012.
[42] A. Napov and X. S. Li, {\it An algebraic multifrontal preconditioner that exploits the low-rank property}, Numer. Linear Algebra Appl., 23 (2016), pp. 61-82, . · Zbl 1413.65049
[43] Y. Notay, {\it An aggregation-based algebraic multigrid method}, Electron. Trans. Numer. Anal., 37 (2010), pp. 123-146. · Zbl 1206.65133
[44] F. Pellegrini and J. Roman, {\it Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs}, in High-Performance Computing and Networking, Springer-Verlag, London, 1996, pp. 493-498.
[45] G. Quintana-Ortí, X. Sun, and C. H. Bischof, {\it A BLAS-3 version of the QR factorization with column pivoting}, SIAM J. Sci. Comput., 19 (1998), pp. 1486-1494, . · Zbl 0912.65034
[46] J. Reinders, {\it Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism}, O’Reilly Media, Sebastopol, CA, 2007.
[47] F.-H. Rouet, X. S. Li, P. Ghysels, and A. Napov, {\it A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization}, ACM Trans. Math. Software, to appear. · Zbl 1369.65043
[48] D. Ruiz, {\it A Scaling Algorithm to Equilibrate Both Rows and Columns Norms in Matrices}, Technical report RT/APO/01/4, ENSEEIHT-IRIT, Toulouse, France, 2001; also appeared as RAL report RAL-TR-2001-034, 2001.
[49] O. Schenk, K. Gärtner, and W. Fichtner, {\it Efficient sparse lu factorization with left-right looking strategy on shared memory multiprocessors}, BIT, 40 (2000), pp. 158-176. · Zbl 0957.65016
[50] R. Vandebril, M. Van Barel, G. Golub, and N. Mastronardi, {\it A bibliography on semiseparable matrices}, Calcolo, 42 (2005), pp. 249-270. · Zbl 1168.15312
[51] S. Wang, M. V. de Hoop, and J. Xia, {\it On 3D modeling of seismic wave propagation via a structured parallel multifrontal direct Helmholtz solver}, Geophys. Prospect., 59 (2011), pp. 857-873.
[52] S. Wang, X. S. Li, F.-H. Rouet, J. Xia, and M. V. De Hoop, {\it A parallel geometric multifrontal solver using hierarchically semiseparable structure}, ACM Trans. Math. Software, 42 (2016), 21. · Zbl 1369.65050
[53] C. Weisbecker, {\it Improving Multifrontal Solvers by Means of Algebraic Block Low-Rank Representations}, Ph.D. thesis, Institut National Polytechnique de Toulouse, Toulouse, France, 2013.
[54] J. H. Wilkinson, {\it Rounding Errors in Algebraic Processes}, Dover Publications, New York, 1994. · Zbl 0868.65027
[55] J. Xia, {\it Randomized sparse direct solvers}, SIAM J. Matrix Anal. Appl., 34 (2013), pp. 197-227, . · Zbl 1269.65029
[56] J. Xia, S. Chandrasekaran, M. Gu, and X. S. Li, {\it Fast algorithms for hierarchically semiseparable matrices}, Numer. Linear Algebra Appl., 17 (2010), pp. 953-976. · Zbl 1240.65087
[57] J. Xia, Y. Xi, and M. Gu, {\it A superfast structured solver for Toeplitz linear systems via randomized sampling}, SIAM J. Matrix Anal. Appl., 33 (2012), pp. 837-858, . · Zbl 1258.65030
[58] A. YarKhan, J. Kurzak, and J. Dongarra, {\it Quark Users Guide: Queueing and Runtime for Kernels}, University of Tennessee Innovative Computing Laboratory Technical Report ICL-UT-11-02, University of Tennessee, Knoxville, TN, 2011.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.