## Efficient CSR-based sparse matrix-vector multiplication on GPU.(English)Zbl 1400.65070

Summary: Sparse matrix-vector multiplication (SpMV) is an important operation in computational science and needs be accelerated because it often represents the dominant cost in many widely used iterative methods and eigenvalue problems. We achieve this objective by proposing a novel SpMV algorithm based on the compressed sparse row (CSR) on the GPU. Our method dynamically assigns different numbers of rows to each thread block and executes different optimization implementations on the basis of the number of rows it involves for each block. The process of accesses to the CSR arrays is fully coalesced, and the GPU’s DRAM bandwidth is efficiently utilized by loading data into the shared memory, which alleviates the bottleneck of many existing CSR-based algorithms (i.e., CSR-scalar and CSR-vector). Test results on C2050 and K20c GPUs show that our method outperforms a perfect-CSR algorithm that inspires our work, the vendor tuned CUSPARSE V6.5 and CUSP V0.5.1, and three popular algorithms clSpMV, CSR5, and CSR-Adaptive.

### MSC:

 65Y10 Numerical algorithms for specific classes of architectures 65F50 Computational methods for sparse matrices

### Software:

LightSpMV; SELL_C_sigma; CUDA; clSpMV; CUSPARSE; yaSpMV; CSR5; CUSP; SparseMatrix
Full Text:

### References:

 [1] NVIDIA, CUDA C Programming Guide 6.5 [2] Saad, Y., Iterative Methods for Sparse Linear Systems, (2003), Philadelphia, Pa, USA: Society for Industrial and Applied Mathematics, Philadelphia, Pa, USA · Zbl 1002.65042 [3] Gao, J.; Li, Z.; Liang, R.; He, G., Adaptive optimization $$l_1$$-minimization solvers on GPU, International Journal of Parallel Programming, (2016) [4] Li, J.; Li, X.; Yang, B.; Sun, X., Segmentation-based image copy-move forgery detection scheme, IEEE Transactions on Information Forensics and Security, 10, 3, 507-518, (2015) [5] Xia, Z.; Wang, X.; Sun, X.; Wang, Q., A secure and dynamic multi-keyword ranked search scheme over encrypted cloud data, IEEE Transactions on Parallel and Distributed Systems, 27, 2, 340-352, (2016) [6] Nickolls, J.; Buck, I.; Garland, M.; Skadron, K., Scalable parallel programming with CUDA, Queue, 6, 2, 40-53, (2008) [7] Bell, N.; Garland, M., Efficient sparse matrix-vector multiplication on CUDA, (2008), NVIDIA [8] Bell, N.; Garland, M., Implementing sparse matrix-vector multiplication on throughput-oriented processors, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09), ACM [9] Baskaran, M. M.; Bordawekar, R., Optimizing Sparse Matrixvector Multiplication on GPUs, (2009), Yorktown Heights, NY, USA: IBM Research, Yorktown Heights, NY, USA [10] Dehnavi, M. M.; Fernández, D. M.; Giannacopoulos, D., Finite-element sparse matrix vector multiplication on graphic processing units, IEEE Transactions on Magnetics, 46, 8, 2982-2985, (2010) [11] Dehnavi, M. M.; Fernández, D. M.; Giannacopoulos, D., Enhancing the performance of conjugate gradient solvers on graphic processing units, IEEE Transactions on Magnetics, 47, 5, 1162-1165, (2011) [12] Reguly, I.; Giles, M., Efficient sparse matrix-vector multiplication on cache-based GPUs, Proceedings of the IEEE Innovative Parallel Computing [13] Gao, J.; Liang, R.; Wang, J., Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU, Journal of Parallel and Distributed Computing, 74, 2, 2088-2098, (2014) [14] Greathouse, J. L.; Daga, M., Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’14), ACM [15] Koza, Z.; Matyka, M.; Szkoda, S., Compressed multirow storage format for sparse matrices on graphics processing units, SIAM Journal on Scientific Computing, 36, 2, C219-C239, (2014) · Zbl 1296.65069 [16] Liu, Y.; Schmidt, B., LightSpMV: faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs, Proceedings of the IEEE 26th International Conference on Application-Specific Systems, Architectures and Processors (ASAP ’15), IEEE [17] Daga, M.; Greathouse, J. L., Structural agnostic SpMV: adapting CSR-adaptive for irregular matrices, Proceedings of the IEEE 22nd International Conference on High Performance Computing (HiPC ’15) [18] Liu, W.; Vinter, B., CSR5: an efficient storage format for crossplatform sparse matrix-vector multiplication, Proceedings of the 29th ACM International Conference on Supercomputing (ICS ’15), ACM [19] He, G.; Gao, J., A novel CSR-based sparse matrix-vector multiplication on GPUs, Mathematical Problems in Engineering, 2016, (2016) [20] NVIDIA, CUDA C Best Practices Guide 6.5, (2014) [21] NVIDIA, CUSPARSE Library 6.5, (2014) [22] Bell, N.; Garland, M., Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations, Version 0.5.1, (2015) [23] Su, B.-Y.; Keutzer, K., clSpMV: a cross-platform OpenCL SpMV framework on GPUs, Proceedings of the 26th ACM International Conference on Supercomputing (ICS ’12) [24] Karakasis, V.; Gkountouvas, T.; Kourtis, K.; Goumas, G.; Koziris, N., An extended compression format for the optimization of sparse matrix-vector multiplication, IEEE Transactions on Parallel and Distributed Systems, 24, 10, 1930-1940, (2013) [25] Tang, W. T.; Tan, W. J.; Ray, R., Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13), IEEE [26] Verschoor, M.; Jalba, A. C., Analysis and performance estimation of the conjugate gradient method on multiple GPUs, Parallel Computing, 38, 10-11, 552-575, (2012) [27] Vázquez, F.; Fernández, J. J.; Garzõn, E. M., A new approach for sparse matrix vector product on NVIDIA GPUs, Concurrency and Computation: Practice and Experience, 23, 8, 815-826, (2011) [28] Choi, J. W.; Singh, A.; Vuduc, R. W., Model-driven autotuning of sparse matrix-vector multiply on GPUs, Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’10), ACM [29] Monakov, A.; Lokhmotov, A.; Avetisyan, A., Automatically tuning sparse matrix-vector multiplication for GPU architectures, High Performance Embedded Architectures and Compilers. High Performance Embedded Architectures and Compilers, Lecture Notes in Computer Science, 5952, 111-125, (2010), Berlin, Germany: Springer, Berlin, Germany [30] Kreutzer, M.; Hager, G.; Wellein, G.; Fehske, H.; Bishop, A. R., A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units, SIAM Journal on Scientific Computing, 36, 5, C401-C423, (2014) · Zbl 1307.65055 [31] Dang, H.-V.; Schmidt, B., CUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operations, Parallel Computing. Systems & Applications, 39, 11, 737-750, (2013) [32] Yan, S.; Li, C.; Zhang, Y.; Zhou, H., YaSpMV: yet another SpMV framework on GPUs, Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’14), ACM [33] Lu, F.; Song, J.; Yin, F.; Zhu, X., Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters, Computer Physics Communications, 183, 6, 1172-1181, (2012) [34] Jang, B.; Schaa, D.; Mistry, P.; Kaeli, D., Exploiting memory access patterns to improve memory performance in data-parallel architectures, IEEE Transactions on Parallel and Distributed Systems, 22, 1, 105-118, (2011) [35] Williams, S.; Oliker, L.; Vuduc, R.; Shalf, J.; Yelick, K.; Demmel, J., Optimization of sparse matrix-vector multiplication on emerging multicore platforms, Parallel Computing, 35, 3, 178-194, (2009) [36] Davis, T. A.; Hu, Y., The university of florida sparse matrix collection, ACM Transactions on Mathematical Software, 38, 1, 1-25, (2011) · Zbl 1365.65123 [37] Gao, J.; Wang, Y.; Wang, J., A novel multi-graphics processing unit parallel optimization framework for the sparse matrix-vector multiplication, Concurrency and Computation: Practice and Experience, (2016)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.