A heterogeneous parallel LU factorization algorithm based on a basic column block uniform allocation strategy. (English) Zbl 1435.65047

Summary: Most supercomputers are shipped with both a CPU and a GPU. With the powerful parallel computing capability of GPUs, heterogeneous computing architecture produces new challenges for system software development and application design. Because of the significantly different architectures and programming models of CPUs and GPUs, conventional optimization techniques for CPUs may not work well in a heterogeneous multi-CPU and multi-GPU system. We present a heterogeneous parallel LU factorization algorithm for heterogeneous architectures. According to the different performances of the processors in the system, any given matrix is partitioned into different sizes of basic column blocks. Then, a static task allocation strategy is used to distribute the basic column blocks to corresponding processors uniformly. The idle time is minimized by optimized sizes and the number of basic column blocks. Right-looking ahead technology is also used in systems configured with one CPU core to one GPU to decrease the wait time. Experiments are conducted to test the performance of synchronization and load balancing, communication cost, and scalability of the heterogeneous parallel LU factorization in different systems and compare it with the related matrix algebra algorithm on a heterogeneous system configured with multiple GPUs and CPUs.


65F05 Direct numerical methods for linear systems and matrix inversion
65Y05 Parallel numerical computation
68W10 Parallel algorithms in computer science
Full Text: DOI


[1] Kim, N. S.; Chen, D.; Xiong, J.; Hwu, W.-M. W., Heterogeneous computing meets near-memory acceleration and high-level synthesis in the post-moore era, IEEE Micro, 37, 4, 10-18 (2017)
[2] Zhang, L.; Li, K.; Li, C.; Li, K., Bi-objective workflow scheduling of the energy consumption and reliability in heterogeneous computing systems, Information Sciences, 379, 241-256 (2015)
[3] Zahran, M., Heterogeneous computing: Here to stay, Communications of the ACM, 60, 3, 42-45 (2017)
[4] Wang, J.; Wang, T.; Yang, Z.; Mi, N.; Sheng, B., eSplash: Efficient speculation in large scale heterogeneous computing systems, Proceedings of the IEEE 35th International Performance Computing and Communications Conference (IPCCC ’16)
[5] Gregg, C.; Hazelwood, K., Where is the data? Why you cannot debate CPU vs. GPU performance without the answer, Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’11)
[6] Lee, V. W.; Hammarlund, P.; Singhal, R.; Dubey, P.; Kim, C.; Chhugani, J.; Deisher, M.; Kim, D.; Nguyen, A. D.; Satish, N.; Smelyanskiy, M.; Chennupaty, S., Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU, ACM SIGARCH Computer Architecture News, 38, 3, 451-460 (2010)
[7] Mittal, S.; Vetter, J. S., A survey of methods for analyzing and improving GPU energy efficiency, ACM Computing Surveys, 47, 2, 19:1-19:23 (2015)
[8] Mittal, S.; Vetter, J. S., A survey of CPU-GPU heterogeneous computing techniques, ACM Computing Surveys, 47, 4, 69:1-69:36 (2015)
[9] Rogers, P.; Fellow, A. C., Heterogeneous system architecture overview, Proceedings of the IEEE Hot Chips 25 Symposium (HCS ’13)
[10] Power, J.; Hestness, J.; Orr, M. S.; Hill, M. D.; Wood, D. A., Gem5-gpu: A heterogeneous CPU-GPU simulator, IEEE Computer Architecture Letters, 14, 1, 34-36 (2015)
[11] Kan, G.; Lei, T.; Liang, K.; Li, J.; Ding, L.; He, X.; Yu, H.; Zhang, D.; Zuo, D.; Bao, Z.; Amo-Boateng, M.; Hu, Y.; Zhang, M., A multi-core CPU and many-core GPU based fast parallel shuffled complex evolution global optimization approach, IEEE Transactions on Parallel and Distributed Systems, 28, 2, 332-344 (2017)
[12] Cheng, Z.; Shaffer, E.; Yeh, R.; Zagaris, G.; Olson, L., Efficient parallel optimization of volume meshes on heterogeneous computing systems, Engineering with Computers, 33, 4, 717-726 (2017)
[13] Chen, C.; Fang, J.; Tang, T.; Yang, C., LU factorization on heterogeneous systems: an energy-efficient approach towards high performance, Computing: Archives for Scientific Computing, 99, 8, 791-811 (2017) · Zbl 1372.65091
[14] Grigori, L.; Cayrols, S.; Demmel, J. W., Low rank approximation of a sparse matrix based on {LU} factorization with column and row tournament pivoting, SIAM Journal on Scientific Computing, 40, 2, C181-C209 (2018) · Zbl 1453.65090
[15] Dong, T.; Haidar, A.; Luszczek, P.; Harris, J. A.; Tomov, S.; Dongarra, J., LU factorization of small matrices: Accelerating batched DGETRF on the GPU, Proceedings of the 16th IEEE International Conference on High Performance Computing and Communications (HPCC ’14), 11th IEEE International Conference on Embedded Software and Systems (ICESS ’14) and 6th International Symposium on Cyberspace Safety and Security (CSS ’14)
[16] Buttari, A.; Langou, J.; Kurzak, J.; Dongarra, J., A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Computing. Systems & Applications, 35, 1, 38-53 (2009)
[17] Anderson, E.; Bai, Z.; Bischof, C.; Blackford, L. S.; Demmel, J.; Dongarra, J.; Du Croz, J.; Greenbaum, A.; Hammarling, S.; McKenney, A.; Sorensen, D., LAPACK Users’ Guide (1999), Society for Industrial and Applied Mathematics · Zbl 0934.65030
[18] Choi, J.; Demmel, J.; Dhillon, I.; Dongarra, J.; Ostrouchov, S.; Petitet, A.; Stanley, K.; Walker, D.; Whaley, R. C., ScaLAPACK: a portable linear algebra library for distributed memory computers—design issues and performance, Computer Physics Communications, 97, 1-2, 1-15 (1996) · Zbl 0926.65148
[19] Tomov, S.; Dongarra, J.; Baboulin, M., Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Computing, 36, 5-6, 232-240 (2010) · Zbl 1204.68268
[20] NVIDIA, DU-06702-001_v9.2 CUBLAS library user guide, Available: https://docs.nvidia.com/cuda/archive/9.2/pdf/CUBLAS_Library.pdf
[21] Humphrey, J. R.; Kelmelis, E. J.; Price, D. K.; Spagnoli, K. E.; Paolini, A. L.; Kelmelis, E. J., CULA: hybrid GPU accelerated linear algebra routines, Proceedings of the SPIE Defense, Security, and Sensing
[22] Volkov, V.; Demmel, J. W.; LU., QR and Cholesky factorizations using vector capabilities of GPUs, Technical Report, UCB/EECS-2008-49 (2008), Berkeley, Calif, USA: EECS Department, University of California, Berkeley, Calif, USA
[23] van de Velde, E. F., Experiments with multicomputer LU‐decomposition, Concurrency: Practice and Experience, 2, 1, 1-26 (1990)
[24] Kurzak, J.; Luszczek, P.; Faverge, M.; Dongarra, J., LU factorization with partial pivoting for a multicore system with accelerators, IEEE Transactions on Parallel and Distributed Systems, 24, 8, 1613-1621 (2013)
[25] Catalán S, S.; Rodríguez-Sánchez, R.; Quintana-Ortí, E. S., Static versus dynamic task scheduling of the lu factorization on arm big. little architectures, Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW ’17)
[26] Deisher, M.; Smelyanskiy, M.; Nickerson, B.; Lee, V. W.; Chuvelev, M.; Dubey, P., Designing and dynamically load balancing hybrid LU for multi/many-core, Computer Science - Research and Development, 26, 3-4, 211-220 (2011)
[27] Endo, T.; Matsuoka, S.; Nukada, A.; Maruyama, N., Linpack evaluation on a supercomputer with heterogeneous accelerators, Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS ’10)
[28] Choi, J.; Dongarra, J. J.; Ostrouchov, L. S.; Petitet, A. P.; Walker, D. W.; Whaley, R. C., Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines, Scientific Programming, 5, 3, 173-184 (1996)
[29] Wang, F.; Yang, C.-Q.; Du, Y.-F.; Chen, J.; Yi, H.-Z.; Xu, W.-X., Optimizing linpack benchmark on GPU-accelerated petascale supercomputer, Journal of Computer Science and Technology, 26, 5, 854-865 (2011)
[30] Lastovetsky, A.; Reddy, R., Data distribution for dense factorization on computers with memory heterogeneity, Parallel Computing. Systems & Applications, 33, 12, 757-779 (2007)
[31] Wu, R., A heterogeneous parallel cholesky block factorization algorithm, IEEE Access, 6, 14071-14077 (2018)
[32] Dong, F.; Tomov, S.; Dongarra, J., Efficient support for matrix computations on heterogeneous multi-core and multi-GPU architectures, Technical Report, UT-CS-11-668 (2011), Knoxville, TN, USA: University of Tennessee Knoxville, Knoxville, TN, USA
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.