×

Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP. (English) Zbl 1228.65038

Summary: Recent developments in high performance computer architecture have a significant effect on all fields of scientific computing. Linear algebra and especially the solution of systems of linear equations lie at the heart of many applications in scientific computing.
This paper describes and analyzes three parallel versions of the dense direct methods such as the Gaussian elimination method and the LU form of Gaussian elimination that are used in linear system solving on a multicore using an OpenMP interface. More specifically, we present two naive parallel algorithms based on row block and row cyclic data distribution and we put special emphasis on presenting a third parallel algorithm based on the pipeline technique.
Further, we propose an implementation of the pipelining technique in OpenMP. Experimental results on a multicore CPU show that the proposed OpenMP pipeline implementation achieves good overall performance compared to the other two naive parallel methods. Finally, in this work we propose a simple, fast and reasonably analytical model to predict the performance of the direct methods with the pipelining technique.

MSC:

65F05 Direct numerical methods for linear systems and matrix inversion
65Y05 Parallel numerical computation
68W30 Symbolic computation and algebraic computation
65Y20 Complexity and performance of numerical algorithms
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Cosnard, M.; Trystram, D., Parallel Algorithms and Architectures (1995), International Thomson Pub.
[2] Buttari, A.; Langou, J.; Kurzak, J.; Dongarra, J. J., Parallel tiled QR factorization for multicore architectures, Concurrency and Computation: Practice and Experience, 20, 13, 1573-1590 (2008)
[3] Buttari, A.; Langou, J.; Kurzak, J.; Dongarra, J. J., A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Computing Systems & Applications, 35, 38-53 (2009)
[4] Buttari, A.; Dongarra, J. J.; Husbands, P.; Kurzak, J.; Yelick, K., Multithreading for synchronization tolerance in matrix factorization, Proc. of Scientific Discovery through Advanced Computing. Proc. of Scientific Discovery through Advanced Computing, Journal of Physics: Conference Series, 78, 012028 (2007)
[5] Anderson, E.; Bai, Z.; Bischof, C.; Blackford, L. S.; Demmel, J. W.; Dongarra, J.; Du Croz, J.; Greenbaum, A.; Hammarling, S.; McKenney, A.; Sorensen, D., LAPACK Users Guide (1992), SIAM · Zbl 0934.65030
[6] Blackford, L. S.; Choi, J.; Cleary, A.; D’Azevedo, E.; Demmel, J.; Dhillon, I.; Dongarra, J.; Hammarling, S.; Henry, G.; Petitet, A.; Stanley, K.; Walker, D.; Whaley, R. C., ScaLAPACK Users Guide (1997), SIAM · Zbl 0886.65022
[7] Elmroth, E.; Gustavson, F. G., Applying recursion to serial and parallel QR factorization leads to better performance, IBM Journal of Research and Development, 44, 4, 605-624 (2000)
[8] Elmroth, E.; Gustavson, F. G., New serial and parallel recursive QR factorization algorithms for SMP systems, (Proc. of International Workshop Applied Parallel Computing, Large Scale Scientific and Industrial Problems. Proc. of International Workshop Applied Parallel Computing, Large Scale Scientific and Industrial Problems, Lecture Notes in Computer Science, vol. 1541 (1998)), 120-128
[9] Elmroth, E.; Gustavson, F. G., High performance library software for QR factorization, (Proc. of 5th International Workshop Applied Parallel Computing, New Paradigms for HPC in Industry and Academia. Proc. of 5th International Workshop Applied Parallel Computing, New Paradigms for HPC in Industry and Academia, Lecture Notes in Computer Science, vol. 1947 (2000)), 53-63
[10] Gunter, B. C.; van de Geijn, R. A., Parallel out-of-core computation and updating the QR factorization, ACM Transactions on Mathematical Software, 31, 1, 60-78 (2005) · Zbl 1073.65023
[11] Marques, M.; Quintana-Orti, G.; Quintana-Orti, E. S.; van de Geijn, R., Out-of-core computation of the QR factorization on multi-core processors, (Proc. of the 2009 Euro-Par Parallel Processing. Proc. of the 2009 Euro-Par Parallel Processing, Lecture Notes in Computer Science, vol. 5704 (2009)), 809-820
[12] S.F. McGinn, R.E. Shaw, Parallel Gaussian elimination using openMP and MPI, in: Proc. of the 16th Annual International Symposium on High Performance Computing Systems and Applications, 2002, pp. 169-173.; S.F. McGinn, R.E. Shaw, Parallel Gaussian elimination using openMP and MPI, in: Proc. of the 16th Annual International Symposium on High Performance Computing Systems and Applications, 2002, pp. 169-173.
[13] C.J. Thompson, S. Hahn, M. Oskin, Using modern graphics architectures for general computing: a framework and analysis, in: Proc. 35th IEEE/ACM Int’l Symposium Microarchitecture, 2002, pp. 306-317.; C.J. Thompson, S. Hahn, M. Oskin, Using modern graphics architectures for general computing: a framework and analysis, in: Proc. 35th IEEE/ACM Int’l Symposium Microarchitecture, 2002, pp. 306-317.
[14] E.S. Larsen, D. McAllister, Fast matrix multipliers using graphics hardware, in: Proc. High Performance Networking and Computing Conference, 2001, p. 55.; E.S. Larsen, D. McAllister, Fast matrix multipliers using graphics hardware, in: Proc. High Performance Networking and Computing Conference, 2001, p. 55.
[15] Whaley, R. C.; Petitet, A.; Dongarra, J. J., Automated empirical optimizations of software and the ATLAS project, Parallel Computing, 27, 3-35 (2001) · Zbl 0971.68033
[16] J.D. Hall, N.A. Carr, J.C. Hart, Cache and bandwidth aware matrix multiplication on the GPU, Technical Report UIUCDCS-R-2003-2328, University of Illinois, 2003.; J.D. Hall, N.A. Carr, J.C. Hart, Cache and bandwidth aware matrix multiplication on the GPU, Technical Report UIUCDCS-R-2003-2328, University of Illinois, 2003.
[17] K. Fatahalian, J. Sugerman, P. Hanrahan, Understanding the efficiency of GPU algorithms for matrix-matrix multiplication, in: Proc. SIGGRAPH/EUROGRAPHICS Workshop Graphics Hardware, 2004, pp. 133-137.; K. Fatahalian, J. Sugerman, P. Hanrahan, Understanding the efficiency of GPU algorithms for matrix-matrix multiplication, in: Proc. SIGGRAPH/EUROGRAPHICS Workshop Graphics Hardware, 2004, pp. 133-137.
[18] Kruger, J.; Westermann, R., Linear algebra operators for GPU implementation of numerical algorithms, ACM Transactions on Graphics, 22, 908-916 (2003)
[19] A. Moravanszky, Dense matrix algebra on the GPU, 2003. http://www.shaderx2.com/shaderx.pdf; A. Moravanszky, Dense matrix algebra on the GPU, 2003. http://www.shaderx2.com/shaderx.pdf
[20] V. Volkov, J.W. Demmel, Benchmarking GPUs to tune dense linear algebra, in: Proc. of the 2008 ACM/IEEE Conference on Supercomputing, 2008, pp. 1-11.; V. Volkov, J.W. Demmel, Benchmarking GPUs to tune dense linear algebra, in: Proc. of the 2008 ACM/IEEE Conference on Supercomputing, 2008, pp. 1-11.
[21] N. Galoppo, N.K. Govindaraju, M. Henson, D. Manocha, \(L U\); N. Galoppo, N.K. Govindaraju, M. Henson, D. Manocha, \(L U\)
[22] I. Fumihiko, M. Manabu, G. Keigo, H. Kenichi, Performance study of \(L U\); I. Fumihiko, M. Manabu, G. Keigo, H. Kenichi, Performance study of \(L U\)
[23] G. Quintana-Orti, F.D. Igual, E.S. Quintana-Orti, R. van de Geijn, Solving dense linear algebra problems on platforms with multiple hardware accelerators, in: Proc. of the 14th ACM/SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009, pp. 121-130.; G. Quintana-Orti, F.D. Igual, E.S. Quintana-Orti, R. van de Geijn, Solving dense linear algebra problems on platforms with multiple hardware accelerators, in: Proc. of the 14th ACM/SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009, pp. 121-130.
[24] Kurzak, J.; Dongarra, J. J., QR factorization for the CELL processor, Scientific Programming, 17, 1-2, 31-42 (2009)
[25] Kurzak, J.; Buttari, A.; Dongarra, J., Solving systems of linear equation on the CELL processor using Cholesky factorization, IEEE Transactions on Parallel and Distributed Systems, 19, 9, 1175-1186 (2008)
[26] Vishwas, B. C.; Gadia, A.; Chaudhuri, M., Implementing a parallel matrix factorization library on the cell broadband engine, Scientific Programming, 17, 1-2, 3-29 (2009)
[27] Agullo, E.; Demmel, J.; Dongarra, J.; Hadri, N.; Kurzak, J.; Langou, J.; Ltaief, H.; Luszczek, P.; Tomov, S., Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects, Journal of Physics: Conference Series, 180 (2009)
[28] Golub, G. H.; Van Loan, Ch., Matrix Computations (1996), Johns Hopkins University Press · Zbl 0865.65009
[29] E. Chan, E.S. Quintana-Orti, G. Quintana-Orti, R. van de Geijn, SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures, in: Proc. SPAA, 2007, pp. 116-125.; E. Chan, E.S. Quintana-Orti, G. Quintana-Orti, R. van de Geijn, SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures, in: Proc. SPAA, 2007, pp. 116-125.
[30] H. Ltaeif, J. Kurzak, J. Dongarra, Scheduling two-sided transformations using algorithms-by-tiles on multicore architectures, Technical Report UT-CS-09-637, Univ. of Tenn., Knoxville, 2009.; H. Ltaeif, J. Kurzak, J. Dongarra, Scheduling two-sided transformations using algorithms-by-tiles on multicore architectures, Technical Report UT-CS-09-637, Univ. of Tenn., Knoxville, 2009.
[31] Kurzak, J.; Ltaief, H.; Dongarra, J.; Badia, R. M., Scheduling dense linear algebra operations on multicore processors, Concurrency and Computation: Practice and Experience, 22, 1, 15-44 (2010)
[32] Kurzak, J.; Dongarra, J., Implementing linear algebra routines on multi-core processors with pipelining and a look ahead, (Proc. of the 8th International Conference on Applied Parallel Computing: State of the Art in Scientific Computing. Proc. of the 8th International Conference on Applied Parallel Computing: State of the Art in Scientific Computing, PARA 2006. Proc. of the 8th International Conference on Applied Parallel Computing: State of the Art in Scientific Computing. Proc. of the 8th International Conference on Applied Parallel Computing: State of the Art in Scientific Computing, PARA 2006, Lecture Notes in Computer Science, vol. 4699 (2006)), 147-156
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.