×

Auto-tuned Krylov methods on cluster of graphics processing unit. (English) Zbl 1314.65049

Summary: Exascale computers are expected to have highly hierarchical architectures with nodes composed by multiple core processors (CPU; central processing unit) and accelerators (GPU; graphics processing unit). The different programming levels generate new difficult algorithm issues. In particular when solving extremely large linear systems, new programming paradigms of Krylov methods should be defined and evaluated with respect to modern state of the art of scientific methods. Iterative Krylov methods involve linear algebra operations such as dot product, norm, addition of vectors and sparse matrix-vector multiplication. These operations are computationally expensive for large size matrices. In this paper, we aim to focus on the best way to perform effectively these operations, in double precision, on GPU in order to make iterative Krylov methods more robust and therefore reduce the computing time. The performance of our algorithms is evaluated on several matrices arising from engineering problems. Numerical experiments illustrate the robustness and accuracy of our implementation compared to the existing libraries. We deal with different preconditioned Krylov methods: conjugate gradient for symmetric positive-definite matrices, and generalized conjugate residual, bi-conjugate gradient conjugate residual, transpose-free quasi-minimal residual, stabilized bi-conjugate gradient and stabilized bi-conjugate gradient \((L)\) for the solution of sparse linear systems with non symmetric matrices. We consider and compare several sparse compressed formats, and propose a way to implement effectively Krylov methods on GPU and on multicore CPU. Finally, we give strategies to faster algorithms by auto-tuning the threading design, upon the problem characteristics and the hardware changes. As a conclusion, we propose and analyse hybrid sub-structuring methods that should pave the way to exascale hybrid methods.

MSC:

65F10 Iterative numerical methods for linear systems
65F50 Computational methods for sparse matrices
65Y05 Parallel numerical computation
65Y10 Numerical algorithms for specific classes of architectures
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Aliaga J.I., Lecture Notes in Computer Science 7133 pp 162– (2010)
[2] Anzt H., Lecture Notes in Computer Science 7134 pp 237– (2010)
[3] Bell N., Efficient sparse matrix-vector multiplication on CUDA
[4] Bell N., Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations (2012)
[5] DOI: 10.1145/882262.882364 · Zbl 05457243 · doi:10.1145/882262.882364
[6] Davis T.A., ACM Trans. Math. Softw 38 pp 1– (2011)
[7] Gander M.J., SIAM 44 pp 699– (2006)
[8] DOI: 10.1137/090779760 · Zbl 1220.65037 · doi:10.1137/090779760
[9] D.R. Kincaid, T.C. Oppe, and D.M. Young,ITPACKV 2D user’s guide, Report CNA-232, University of Texas at Austin, Department of Mathematics, Austin, TX, USA, 1989.
[10] DOI: 10.1145/882262.882363 · Zbl 05457403 · doi:10.1145/882262.882363
[11] DOI: 10.1007/s11227-012-0825-3 · doi:10.1007/s11227-012-0825-3
[12] Maday Y., Comput. Methods Appl. Math 195 pp 3880– (2006)
[13] DOI: 10.1016/j.apm.2005.05.020 · Zbl 1102.65126 · doi:10.1016/j.apm.2005.05.020
[14] DOI: 10.1002/fld.1243 · Zbl 1157.65068 · doi:10.1002/fld.1243
[15] DOI: 10.1016/j.cma.2005.05.059 · Zbl 1173.74477 · doi:10.1016/j.cma.2005.05.059
[16] DOI: 10.1016/j.apm.2005.06.016 · Zbl 1104.65116 · doi:10.1016/j.apm.2005.06.016
[17] DOI: 10.1016/j.compstruc.2004.02.025 · doi:10.1016/j.compstruc.2004.02.025
[18] DOI: 10.1016/j.cma.2004.05.004 · Zbl 1112.74444 · doi:10.1016/j.cma.2004.05.004
[19] DOI: 10.1142/S0218396X05002827 · Zbl 1189.76390 · doi:10.1142/S0218396X05002827
[20] DOI: 10.1016/j.cma.2005.01.022 · Zbl 1126.74054 · doi:10.1016/j.cma.2005.01.022
[21] DOI: 10.1016/j.apm.2005.07.008 · Zbl 1099.74070 · doi:10.1016/j.apm.2005.07.008
[22] DOI: 10.1080/00207160601168605 · Zbl 1116.65123 · doi:10.1080/00207160601168605
[23] Nvidia Corporation, CUDA toolkit reference manual, 4. ed. (2011)
[24] Oberhuber T., New row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA (2010)
[25] Quarteroni A., Domain Decomposition Methods for Partial Differential Equations (1999) · Zbl 0931.65118
[26] DOI: 10.1137/1.9780898718003 · doi:10.1137/1.9780898718003
[27] Toselli A., Domain decomposition methods: Algorithms and Theory (2005) · Zbl 1069.65138
[28] DOI: 10.1002/cpe.1732 · Zbl 06097526 · doi:10.1002/cpe.1732
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.