A parallel multithreaded sparse triangular linear system solver. (English) Zbl 1453.65059

Summary: We propose a parallel sparse triangular linear system solver based on the Spike algorithm. Sparse triangular systems are required to be solved in many applications. Often, they are a bottleneck due to their inherently sequential nature. Furthermore, typically many successive systems with the same coefficient matrix and with different right hand side vectors are required to be solved. The proposed solver decouples the problem at the cost of extra arithmetic operations as in the banded case. Compared to the banded case, there are extra savings due to the sparsity of the triangular coefficient matrix. We show the parallel performance of the proposed solver against the state-of-the-art parallel sparse triangular solver in Intel’s Math Kernel Library (MKL) on a multicore architecture. We also show the effect of various sparse matrix reordering schemes. Numerical results show that the proposed solver outperforms MKL’s solver in \(\sim 80\%\) of cases by a factor of 2.47, on average.


65F05 Direct numerical methods for linear systems and matrix inversion
65Y05 Parallel numerical computation
Full Text: DOI


[1] Hysom, D.; Pothen, A., A scalable parallel algorithm for incomplete factor preconditioning, SIAM J. Sci. Comput., 22, 6, 2194-2215 (2001) · Zbl 0986.65048
[2] Hutchinson, S.; Shadid, J.; Tuminaro, R., Aztec User’s Guide. Version 1. Tech. Rep. (1995), Office of Scientific and Technical Information (OSTI)
[3] Eisenstat, S. C.; Gursky, M. C.; Schultz, M. H.; Sherman, A. H., Yale sparse matrix package i: The symmetric codes, Internat. J. Numer. Methods Engrg., 18, 8, 1145-1151 (1982) · Zbl 0492.65012
[4] Li, X. S.; Demmel, J. W., SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems, ACM Trans. Math. Softw. (TOMS), 29, 2, 110-140 (2003) · Zbl 1068.90591
[5] Falgout, R. D.; Yang, U. M., Hypre: A library of high performance preconditioners, (International Conference on Computational Science (2002), Springer), 632-641 · Zbl 1056.65046
[6] Schenk, O.; Gärtner, K.; Fichtner, W.; Stricker, A., Pardiso: a high-performance serial and parallel sparse linear solver in semiconductor device simulation, Future Gener. Comput. Syst., 18, 1, 69-78 (2001) · Zbl 1032.68172
[7] Balay, S.; Abhyankar, S.; Adams, M.; Brown, J.; Brune, P.; Buschelman, K.; Dalcin, L.; Eijkhout, V.; Gropp, W.; Kaushik, D., Petsc Users Manual Revision 3.8, Tech. Rep. (2017), Argonne National Lab.(ANL): Argonne National Lab.(ANL) Argonne, IL (United States)
[8] Amestoy, P. R.; Duff, I. S.; L’Excellent, J.-Y.; Koster, J., A fully asynchronous multifrontal solver using distributed dynamic scheduling, SIAM J. Matrix Anal. Appl., 23, 1, 15-41 (2001) · Zbl 0992.65018
[9] Davis, T. A.; Duff, I. S., An unsymmetric-pattern multifrontal method for sparse lu factorization, SIAM J. Matrix Anal. Appl., 18, 1, 140-158 (1997) · Zbl 0884.65021
[10] Filippone, S.; Colajanni, M., Psblas: A library for parallel linear algebra computation on sparse matrices, ACM Trans. Math. Softw. (TOMS), 26, 4, 527-550 (2000) · Zbl 1365.65128
[11] Joshi, M.; Karypis, G.; Kumar, V.; Gupta, A.; Gustavson, F., Pspases: An efficient and scalable parallel sparse direct solver, (Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing (1999), Citeseer)
[12] Anderson, E.; Saad, Y., Solving sparse triangular linear systems on parallel computers, Int. J. High Speed Comput., 1, 01, 73-95 (1989) · Zbl 0726.65026
[13] Saltz, J. H., Aggregation methods for solving sparse triangular systems on multiprocessors, SIAM J. Sci. Stat. Comput., 11, 1, 123-144 (1990) · Zbl 0692.65009
[14] Schreiber, R.; Tang, W.-P., Vectorizing the conjugate gradient method, (Proceedings of the Symposium on CYBER 205 Applications (1982))
[15] Naumov, M., Parallel Incomplete-LU and Cholesky Factorization in the Preconditioned Iterative Methods on the GPU, Tech. Rep. (2012), NVIDIA Corp.: NVIDIA Corp. Westford, MA, USA
[16] Picciau, A.; Inggs, G. E.; Wickerson, J.; Kerrigan, E. C.; Constantinides, G. A., Balancing locality and concurrency: solving sparse triangular systems on gpus, (2016 IEEE 23rd International Conference on High-Performance Computing (HiPC) (2016), IEEE), 183-192
[17] Li, R.; Saad, Y., Gpu-accelerated preconditioned iterative linear solvers, J. Supercomput., 63, 2, 443-466 (2013)
[18] Park, J.; Smelyanskiy, M.; Sundaram, N.; Dubey, P., Sparsifying synchronization for high-performance shared-memory sparse triangular solver, (International Supercomputing Conference (2014), Springer), 124-140
[19] Wolf, M. M.; Heroux, M. A.; Boman, E. G., Factors impacting performance of multithreaded sparse triangular solve, (International Conference on High Performance Computing for Computational Science (2010), Springer), 32-44 · Zbl 1323.65146
[20] Rothberg, E.; Gupta, A., Parallel iccg on a hierarchical memory multiprocessor — Addressing the triangular solve bottleneck., Parallel Comput., 18, 7, 719-741 (1992) · Zbl 0792.68053
[21] Hammond, S. W.; Schreiber, R., Efficient iccg on a shared memory multiprocessor, Int. J. High Speed Comput., 4, 01, 1-21 (1992)
[22] Wang, X.; Xue, W.; Liu, W.; Wu, L., Swsptrsv: a fast sparse triangular solve with sparse level tile layout on sunway architectures, (Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2018), ACM), 338-353
[23] Naumov, M.; Castonguay, P.; Cohen, J., Parallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU, Tech. Rep. (2015), NVIDIA Corp.: NVIDIA Corp. Westford, MA, USA
[24] Suchoski, B.; Severn, C.; Shantharam, M.; Raghavan, P., Adapting sparse triangular solution to gpus, (2012 41st International Conference on Parallel Processing Workshops (2012), IEEE), 140-148
[25] Iwashita, T.; Nakashima, H.; Takahashi, Y., Algebraic block multi-color ordering method for parallel multi-threaded sparse triangular solver in iccg method, (Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International (2012), IEEE), 474-483
[26] Ma, S.; Saad, Y., Distributed ILU(0) and SOR Preconditioners for Unstructured Sparse Linear Systems, Tech. Rep. (1994), Army High Performance Computing Research Center
[27] Koester, D. P.; Ranka, S.; Fox, G. C., A parallel gauss-seidel algorithm for sparse power system matrices, (Proceedings of the 1994 ACM/IEEE Conference on Supercomputing (1994), IEEE Computer Society Press), 184-193
[28] Liu, W.; Li, A.; Hogg, J.; Duff, I. S.; Vinter, B., A synchronization-free algorithm for parallel sparse triangular solves, (European Conference on Parallel Processing (2016), Springer), 617-630
[29] Vuduc, R.; Kamil, S.; Hsu, J.; Nishtala, R.; Demmel, J. W.; Yelick, K. A., Automatic performance tuning and analysis of sparse triangular solve, (ICS 2002: Workshop on Performance Optimization Via High-Level Languages and Libraries (2002))
[30] Chow, E.; Anzt, H.; Scott, J.; Dongarra, J., Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning, J. Parallel Distrib. Comput., 119, 219-230 (2018), URL http://www.sciencedirect.com/science/article/pii/S0743731518303034
[31] Totoni, E.; Heath, M. T.; Kale, L. V., Structure-adaptive parallel solution of sparse triangular linear systems, Parallel Comput., 40, 9, 454-470 (2014)
[32] Mayer, J., Parallel algorithms for solving linear systems with sparse triangular matrices, Computing, 86, 4, 291 (2009) · Zbl 1179.65038
[33] Li, X. S., Evaluation of sparse lu factorization and triangular solution on multicore platforms, (International Conference on High Performance Computing for Computational Science (2008), Springer), 287-300
[34] Pothen, A.; Alvarado, F. L., A fast reordering algorithm for parallel sparse triangular solution, SIAM J. Sci. Statist. Comput., 13, 2, 645-653 (1992) · Zbl 0744.65024
[35] Smith, B.; Zhang, H., Sparse triangular solves for ilu revisited: data layout crucial to better performance, Int. J. High Perform. Comput. Appl., 25, 4, 386-391 (2011)
[36] Teranishi, K.; Raghavan, P.; Ng, E., A new data-mapping scheme for latency-tolerant distributed sparse triangular solution, (Supercomputing, ACM/IEEE 2002 Conference (2002), IEEE), 27
[37] Sameh, A. H.; Brent, R. P., Solving triangular systems on a parallel computer, SIAM J. Numer. Anal., 14, 6, 1101-1113 (1977) · Zbl 0375.65016
[38] Chen, S.-C.; Kuck, D. J.; Sameh, A. H., Practical parallel band triangular system solvers, ACM Trans. Math. Softw. (TOMS), 4, 3, 270-277 (1978) · Zbl 0384.65013
[39] Dongarra, J. J.; Sameh, A. H., On some parallel banded system solvers, Parallel Comput., 1, 3-4, 223-235 (1984) · Zbl 0572.65015
[40] Polizzi, E.; Sameh, A. H., A parallel hybrid banded system solver: the SPIKE algorithm, Parallel Comput., 32, 2, 177-194 (2006)
[41] Polizzi, E.; Sameh, A., Spike: A parallel environment for solving banded linear systems, Comput. & Fluids, 36, 1, 113-120 (2007) · Zbl 1181.76110
[42] Manguoglu, M.; Sameh, A. H.; Schenk, O., Pspike: A parallel hybrid sparse linear system solver, (European Conference on Parallel Processing (2009), Springer), 797-808
[43] Schenk, O.; Manguoglu, M.; Sameh, A.; Christen, M.; Sathe, M., Parallel scalable pde-constrained optimization: antenna identification in hyperthermia cancer treatment planning, Comput. Sci.-Res. Dev., 23, 3-4, 177-183 (2009)
[44] Manguoglu, M., A domain-decomposing parallel sparse linear system solver, J. Comput. Appl. Math., 236, 3, 319-325 (2011) · Zbl 1228.65051
[45] Manguoglu, M., Parallel solution of sparse linear systems, (High-Performance Scientific Computing (2012), Springer), 171-184
[46] Bolukbasi, E. S.; Manguoglu, M., A multithreaded recursive and nonrecursive parallel sparse direct solver, (Advances in Computational Fluid-Structure Interaction and Flow Simulation (2016), Springer), 283-292 · Zbl 1356.65117
[47] Venetis, I. E.; Kouris, A.; Sobczyk, A.; Gallopoulos, E.; Sameh, A. H., A direct tridiagonal solver based on givens rotations for gpu architectures, Parallel Comput., 49, 101-116 (2015)
[48] Mendiratta, K.; Polizzi, E., A threaded spike algorithm for solving general banded systems, Parallel Comput., 37, 12, 733-741 (2011)
[49] Karypis, G.; Kumar, V., A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM J. Sci. Comput., 20, 1, 359-392 (1998) · Zbl 0915.68129
[50] Karypis, G.; Kumar, V., A parallel algorithm for multilevel graph partitioning and sparse matrix ordering, J. Parallel Distrib. Comput., 48, 1, 71-95 (1998)
[51] Amestoy, P. R.; Davis, T. A.; Duff, I. S., An approximate minimum degree ordering algorithm, SIAM J. Matrix Anal. Appl., 17, 4, 886-905 (1996) · Zbl 0861.65021
[52] George, A., Nested dissection of a regular finite element mesh, SIAM J. Numer. Anal., 10, 2, 345-363 (1973) · Zbl 0259.65087
[53] George, A.; Liu, J. W., Computer Solution of Large Sparse Positive Definite Systems (1981), Prentice Hall Professional Technical Reference · Zbl 0516.65010
[54] Intel Math Kernel Library. Reference Manual, Tech. Rep., Intel Corporation, Santa Clara, USA (2018). URL https://software.intel.com/en-us/mkl.
[55] Davis, T. A.; Hu, Y., The university of florida sparse matrix collection, ACM Trans. Math. Softw. (TOMS), 38, 1, 1 (2011) · Zbl 1365.65123
[56] Dagum, L.; Menon, R., Openmp: an industry standard API for shared-memory programming, IEEE Comput. Sci. Eng., 5, 1, 46-55 (1998)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.