×

A novel CSR-based sparse matrix-vector multiplication on GPUs. (English) Zbl 1400.65071

Summary: Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs), for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.

MSC:

65Y10 Numerical algorithms for specific classes of architectures
65F50 Computational methods for sparse matrices
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Saad, Y., Iterative Methods for Sparse Linear Systems, (2003), Philadelphia, Pa, USA: SIAM, Philadelphia, Pa, USA · Zbl 1002.65042 · doi:10.1137/1.9780898718003
[2] Bell, N.; Garland, M., Efficient Sparse Matrix-vector Multiplication on CUDA, (2008), NVIDIA
[3] Bell, N.; Garland, M., Implementing sparse matrix-vector multiplication on throughput-oriented processors, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09) · doi:10.1145/1654059.1654078
[4] NVIDIA
[5] Bell, N.; Garland, M., Cusp: Generic parallel algorithms for sparse matrix and graph computations, version 0.5.1
[6] Lu, F.; Song, J.; Yin, F.; Zhu, X., Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters, Computer Physics Communications, 183, 6, 1172-1181, (2012) · doi:10.1016/j.cpc.2012.01.019
[7] Dehnavi, M. M.; Fernández, D. M.; Giannacopoulos, D., Finite-element sparse matrix vector multiplication on graphic processing units, IEEE Transactions on Magnetics, 46, 8, 2982-2985, (2010) · doi:10.1109/TMAG.2010.2043511
[8] Dehnavi, M. M.; Fernández, D. M.; Giannacopoulos, D., Enhancing the performance of conjugate gradient solvers on graphic processing units, IEEE Transactions on Magnetics, 47, 5, 1162-1165, (2011) · doi:10.1109/TMAG.2010.2081662
[9] Greathouse, J. L.; Daga, M., Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’14) · doi:10.1109/sc.2014.68
[10] Karakasis, V.; Gkountouvas, T.; Kourtis, K.; Goumas, G.; Koziris, N., An extended compression format for the optimization of sparse matrix-vector multiplication, IEEE Transactions on Parallel and Distributed Systems, 24, 10, 1930-1940, (2013) · doi:10.1109/TPDS.2012.290
[11] Tang, W. T.; Tan, W. J.; Ray, R.; Wong, Y. W.; Chen, W.; Kuo, S.; Goh, R. S.; Turner, S. J.; Wong, W., Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’13) · doi:10.1145/2503210.2503234
[12] Verschoor, M.; Jalba, A. C., Analysis and performance estimation of the conjugate gradient method on multiple GPUs, Parallel Computing, 38, 10-11, 552-575, (2012) · doi:10.1016/j.parco.2012.07.002
[13] Choi, J. W.; Singh, A.; Vuduc, R. W., Model-driven autotuning of sparse matrix-vector multiply on GPUs, Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’10), ACM · doi:10.1145/1693453.1693471
[14] Oyarzun, G.; Borrell, R.; Gorobets, A.; Oliva, A., MPI-CUDA sparse matrix-vector multiplication for the conjugate gradient method with an approximate inverse preconditioner, Computers & Fluids, 92, 244-252, (2014) · Zbl 1390.65038 · doi:10.1016/j.compfluid.2013.10.035
[15] Vázquez, F.; Fernández, J. J.; Garzón, E. M., A new approach for sparse matrix vector product on NVIDIA GPUs, Concurrency Computation Practice and Experience, 23, 8, 815-826, (2011) · doi:10.1002/cpe.1658
[16] Vázquez, F.; Fernández, J. J.; Garzón, E. M., Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach, Parallel Computing, 38, 8, 408-420, (2012) · doi:10.1016/j.parco.2011.08.003
[17] Monakov, A.; Lokhmotov, A.; Avetisyan, A., Automatically tuning sparse matrix-vector multiplication for GPU architectures, High Performance Embedded Architectures and Compilers: 5th International Conference, HiPEAC 2010, Pisa, Italy, January 25–27, 2010. Proceedings, 111-125, (2010), Berlin, Germany: Springer, Berlin, Germany · doi:10.1007/978-3-642-11515-8_10
[18] Kreutzer, M.; Hager, G.; Wellein, G.; Fehske, H.; Bishop, A. R., A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide simd units, SIAM Journal on Scientific Computing, 36, 5, C401-C423, (2014) · Zbl 1307.65055 · doi:10.1137/130930352
[19] Dang, H.-V.; Schmidt, B., CUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operations, Parallel Computing. Systems & Applications, 39, 11, 737-750, (2013) · doi:10.1016/j.parco.2013.09.005
[20] Yan, S.; Li, C.; Zhang, Y.; Zhou, H., YaSpMV: yet another SpMV framework on GPUs, Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’14) · doi:10.1145/2555243.2555255
[21] Kincaid, D. R.; Young, D. M., A brief review of the ITPACK project, Journal of Computational and Applied Mathematics, 24, 1-2, 121-127, (1988) · Zbl 0657.65048 · doi:10.1016/0377-0427(88)90347-0
[22] Blelloch, G.; Heroux, M.; Zagha, M., Segmented operations for sparse matrix computation on vector multiprocessor, (1993), Pittsburgh, Pa, USA: School of Computer Science, Carnegie Mellon University, Pittsburgh, Pa, USA
[23] Nickolls, J.; Buck, I.; Garland, M.; Skadron, K., Scalable parallel programming with CUDA, ACM Queue, 6, 2, 40-53, (2008) · doi:10.1145/1365490.1365500
[24] Jang, B.; Schaa, D.; Mistry, P.; Kaeli, D., Exploiting memory access patterns to improve memory performance in data-parallel architectures, IEEE Transactions on Parallel and Distributed Systems, 22, 1, 105-118, (2011) · doi:10.1109/tpds.2010.107
[25] Davis, T. A.; Hu, Y., The University of Florida sparse matrix collection, ACM Transactions on Mathematical Software, 38, 1, 1-25, (2011) · Zbl 1365.65123 · doi:10.1145/2049662.2049663
[26] NVIDIA, CUDA C Programming Guide 6.5
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.