PoLAPACK: Parallel factorization routines with algorithmic blocking. (English) Zbl 1001.68204

Summary: LU, QR, and Cholesky factorizations are the most widely used methods for solving dense linear systems of equations, and have been extensively studied and implemented on vector and parallel computers. Most of these factorization routines are implemented with block-partitioned algorithms in order to perform matrix-matrix operations, that is, to obtain the highest performance by maximizing reuse of data in the upper levels of memory, such as cache. Since parallel computers have different performance ratios of computation and communication, the optimal computational block sizes are different from one another in order to generate the maximum performance of an algorithm. Therefore, the data matrix should be distributed with the machine specific optimal block size before the computation. Two small or large a block size makes achieving good performance on a machine nearly impossible. In such a case, getting a better performance may require a complete redistribution of the data matrix.
We present parallel LU, QR, and Cholesky factorization routines with an ‘algorithmic blocking’ on two-dimensional block cyclic data distribution. With the algorithmic blocking, it is possible to obtain the near optimal performance irrespective of the physical block size. The routines are implemented on the Intel Paragon and the SGI/Cray T3E and compared with the corresponding ScaLAPACK factorization routines.


68W30 Symbolic computation and algebraic computation
Full Text: DOI


[1] Choi, Scientific Programming 5 pp 173– (1996)
[2] LAPACK block factorization algorithms on the Intel iPSC/860. LAPACK Working Note 24, Technical Report CS-90-115, University of Tennessee, October 1990.
[3] The design of scalable software libraries for distributed memory concurrent computers. Proceedings of Environment and Tools for Parallel Scientific Computing Workshop, Saint Hilaire du Touvet, France, 7-8 September 1992. Elsevier Science Publishers, 1992; 3-15.
[4] Introduction to Parallel Computing. Benjamin-Cummings: Redwood City, CA, 1994.
[5] ScaLAPACK Users’ Guide. SIAM: Philadelphia, PA, 1997. · Zbl 0886.65022
[6] Using PLAPACK. MIT Press: Cambridge, 1997.
[7] Agarwal, IBM Journal of Research and Development 38 pp 673– (1994)
[8] Choi, Concurrency: Practice and Experience 10 pp 655– (1998) · Zbl 0903.68088
[9] Choi, Concurrency: Practice and Experience 6 pp 543– (1994)
[10] Huss-Lederman, Concurrency: Practice and Experience 6 pp 571– (1994)
[11] SUMMA scalable universal matrix multiplication algorithm. LAPACK Working Note 99, Technical Report CS-95-286, University of Tennessee, 1995.
[12] Lichtenstein, SIAM J. of Sci. Stat. Computing 14 pp 1259– (1993) · Zbl 0925.65046
[13] The data-distribution-independent approach to scalable parallel libraries. Master Thesis, Mississippi State University, 1995.
[14] ScaLAPACK: A portable linear algebra library for distributed memory computers?design issues and performance. Proceedings of SIAM Conference on Parallel Processing, 1997.
[15] LAPACK: A portable linear algebra library for high-performance computers. Proceedings of Supercomputing ’90. IEEE Press, 1990; 1-10.
[16] Li, SIAM J. of Sci. Stat. Computing 9 pp 485– (1986) · Zbl 0644.65020
[17] A proposal for a set of parallel basic linear algebra subprograms. LAPACK Working Note 100, Technical Report CS-95-292, University of Tennessee, 1995.
[18] Matrix Computations (2nd edn). Johns Hopkins University Press: Baltimore, MD, 1989.
[19] Automatically tuned linear algebra software (ATLAS). Proceedings of SC ’98. IEEE Publication, 1998.
[20] Whaley, Parallel Computing (2000)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.