×

On soft errors in the conjugate gradient method: sensitivity and robust numerical detection. (English) Zbl 1458.65027

Summary: The conjugate gradient (CG) method is the most widely used iterative scheme for the solution of large sparse systems of linear equations when the matrix is symmetric positive definite. Although more than 60 years old, it is still a serious candidate for extreme-scale computations on large computing platforms. On the technological side, the continuous shrinking of transistor geometry and the increasing complexity of these devices affect dramatically their sensitivity to natural radiation and thus diminish their reliability. One of the most common effects produced by natural radiation is the single event upset which consists in a bit-flip in a memory cell producing unexpected results at the application level. Consequently, future extreme-scale computing facilities will be more prone to errors of any kind, including bit-flips, during their calculations. These numerical and technological observations are the main motivations for this work, where we first investigate through extensive numerical experiments the sensitivity of CG to bit-flips in its main computationally intensive kernels, namely the matrix-vector product and the preconditioner application. We further propose numerical criteria to detect the occurrence of such soft errors and assess their robustness through extensive numerical experiments.

MSC:

65F10 Iterative numerical methods for linear systems

Software:

PyAMG; mctoolbox; PVM
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] E. Agullo, S. Cools, E. Fatih-Yetkin, L. Giraud, N. Schenkel, and W. Vanroose, On Soft Errors in the Conjugate Gradient Method: Sensitivity and Robust Numerical Detection-Revised, Research Report RR-9330, Inria Bordeaux Sud-Ouest, 2020.
[2] E. Agullo, S. Cools, E. Fatih-Yetkin, L. Giraud, N. Schenkels, and W. Vanroose, A Complementary Note on Soft Errors in the Conjugate Gradient Method: The Persistent Error Case, Research Report RR-9360, Inria Bordeaux Sud-Ouest, August 2020.
[3] E. Agullo, L. Giraud, and L. Poirel, Robust preconditioners via generalized eigenproblems for hybrid sparse linear solvers, SIAM J. Matrix Anal. Appl., 40 (2019), pp. 417-439. · Zbl 1411.65042
[4] W. Bland, A. Bouteiller, T. Hérault, G. Bosilca, and J. J. Dongarra, Post-failure recovery of MPI communication capability: Design and rationale, IJHPCA, 27 (2013), pp. 244-254.
[5] A. Bouras and V. Frayssé, Inexact matrix-vector products in Krylov methods for solving linear systems: A relaxation strategy, SIAM J. Matrix Anal. Appl., 26 (2005), pp. 660-678. · Zbl 1075.65041
[6] A. Bouras, V. Frayssé, and L. Giraud, A Relaxation Strategy for Inner-Outer Linear Solvers in Domain Decomposition Methods, Technical Report TR/PA/00/17, CERFACS, Toulouse, France, 2000.
[7] W. L. Briggs, V. E. Henson, and S. F. McCormick, A Multigrid Tutorial, SIAM, Philadelphia, 2000. · Zbl 0958.65128
[8] J. Elliott, F. Mueller, M. Stoyanov, and C. Webster, Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic, Technical Report 2013-2, 2013.
[9] J. Elliott, M. Hoemmen, and F. Mueller, Evaluating the impact of SDC on the GMRES iterative solver, in proceedings of the 28th International, Parallel and Distributed Processing Symposium, IEEE, 2014, pp. 1193-1202.
[10] J. Elliott, M. Hoemmen, and F. Mueller, Exploiting data representation for fault tolerance, J. Comput. Sci., 14 (2016), pp. 51-60.
[11] MPI: A Message Passing Interface Standard Version 3.1, The MPI Forum, 2015.
[12] M. L. Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni, and E. Eleftheriou, Mixed-precision in-memory computing, Nature Electronics, 1 (2018), pp. 246-253.
[13] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. S. Sunderam, PVM: Parallel Virtual Machine: A Users’ Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, MA, 1994. · Zbl 0849.68032
[14] P. Ghysels and W. Vanroose, Hiding global synchronization latency in the preconditioned conjugate gradient algorithm, Parallel Comput., 40 (2014), pp. 224-238.
[15] A. Greenbaum, Estimating the attainable accuracy of recursively computed residual methods, SIAM J. Matrix Anal. Appl., 18 (1997), pp. 535-551. · Zbl 0873.65027
[16] M. H. Gutknecht and Z. Strakoš, Accuracy of two three-term and three two-term recurrences for Krylov space solvers, SIAM J. Matrix Anal. Appl., 22 (2000), pp. 213-229. · Zbl 0976.65030
[17] N. Halko, P. G. Martinsson, and J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, SIAM Rev., 53 (2011), pp. 217-288. · Zbl 1269.65043
[18] T. Herault and Y. Robert, eds., Fault-Tolerance Techniques for High-Performance Computing, Springer, New York, 2015. · Zbl 1330.68026
[19] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, J. Res. National Bureau of Standards, 46 (1952), pp. 409-436. · Zbl 0048.09901
[20] N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, Philadelphia, 2002. · Zbl 1011.65010
[21] M. Hoemmen and M. A. Heroux, Fault-tolerant iterative methods via selective reliability, in Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis, 2011.
[22] K.-H. Huang and J. a Abraham, Algorithm-based fault tolerance for matnx operations, IEEE Trans. Computers, C-33 (1984), pp. 518-528. · Zbl 0557.68027
[23] J. Liesen and Z. Strakoš, Krylov Subspace Methods, Numer. Math. Sci. Comput., Oxford University Press, New York, 2013. · Zbl 1263.65034
[24] G. Meurant and Z. Strakoš, The Lanczos and conjugate gradient algorithms in finite precision arithmetic, Acta Numer., 15 (2006), pp. 471-542. · Zbl 1113.65032
[25] B. O. Mutlu, G. Kestor, J. Manzano, O. Unsal, S. Chatterjee, and S. Krishnamoorthy, Characterization of the impact of soft errors on iterative methods, in Proceedings of the 25th International Conference on High Performance Computing, IEEE, 2018, pp. 203-214.
[26] L. N. Olson and J. B. Schroder, PyAMG: Algebraic Multigrid Solvers in Python v4.0, Release 4.0, 2018.
[27] B. Parhami, Defect, fault, error,..., or failure?, IEEE Trans. Reliability, 46 (1997), pp. 450-451.
[28] Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed., SIAM, Philadelphia, 2003. · Zbl 1031.65046
[29] P. Sao and R. Vuduc, Self-stabilizing iterative solvers, in Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2013, pp. 1-8.
[30] M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Characterizing the impact of soft errors on iterative methods in scientific computing, in Proceedings of the International Conference on Supercomputing, 2011, p. 152.
[31] V. Simoncini and D. B. Szyld, Theory of inexact Krylov subspace methods and applications to scientific computing, SIAM J. Sci. Comput., 25 (2003), pp. 454-477. · Zbl 1048.65032
[32] Z. Strakoš and P. Tichy, Error estimation in preconditioned conjugate gradients, BIT, 45 (2005), pp. 789-817. · Zbl 1095.65029
[33] H. A. Van der Vorst, Iterative Krylov Methods for Large Linear Systems. Cambridge Monogr. Appl. Comput. Math. 13, Cambridge University Press, Cambridge, UK, 2003. · Zbl 1023.65027
[34] H. A. van der Vorst and Q. Ye, Residual replacement strategies for Krylov subspace iterative methods for the convergence of true residuals, SIAM J. Sci. Comput., 22 (2000), pp. 835-852. · Zbl 0983.65039
[35] R. Velazco, P. Fouillat, and R. Reis, Radiation Effects on Embedded Systems. Springer, New York, 2007.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.