×

Is the multigrid method fault tolerant? The multilevel case. (English) Zbl 1377.65041

Summary: Computing at the exascale level is expected to be affected by a significantly higher rate of faults, due to increased component counts as well as power considerations. Therefore, current day numerical algorithms need to be re-examined to determine if they are fault resilient and to determine which critical operations need to be safeguarded in order to obtain performance that is close to the ideal fault-free method. In a previous paper, a framework for the analysis of random stationary linear iterations was presented and applied to the two grid method. The present work is concerned with the multigrid algorithm for the solution of linear systems of equations, which is widely used on high performance computing systems. It is shown that the Fault-Prone Multigrid Method is not resilient, unless the prolongation operation is protected. Strategies for fault detection and mitigation as well as protection of the prolongation operation are presented and tested, and a guideline for an optimal choice of parameters is devised.

MSC:

65F10 Iterative numerical methods for linear systems
65N22 Numerical solution of discretized equations for boundary value problems involving PDEs
65N55 Multigrid methods; domain decomposition for boundary value problems involving PDEs
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] M. Ainsworth and C. Glusa, {\it Is the multigrid method fault tolerant? The two-grid case}, SIAM J. Sci. Comput., 39 (2017), pp. C116-C143, . · Zbl 1365.65081
[2] A. Avižienis, J.-C. Laprie, B. Randell, and C. Landwehr, {\it Basic concepts and taxonomy of dependable and secure computing}, IEEE Trans. Dependable Secure Comput., 1 (2004), pp. 11-33, .
[3] R. E. Bank, A. H. Sherman, and A. Weiser, {\it Some refinement algorithms and data structures for regular local mesh refinement}, Sci. Comput. Appl. Math. Comput. Phys. Sci., 1 (1983), pp. 3-17.
[4] P. Bougerol and J. Lacroix, {\it Products of Random Matrices with Applications to Schrödinger Operators}, Progr. Probab. Statist. 8, Birkhäuser Boston, Boston, 1985, . · Zbl 0572.60001
[5] D. Braess, {\it Finite Elements. Theory, Fast Solvers and Applications in Solid Mechanics}, 3rd ed., Cambridge University Press, Cambridge, UK, 2007, . Translated from German by Larry L. Schumaker. · Zbl 1118.65117
[6] F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir, {\it Toward exascale resilience}, Internat. J. High Perform. Comput. Appl., 23 (2009), pp. 374-388, .
[7] F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir, {\it Toward exascale resilience: 2014 update}, Supercomput. Frontiers Innovations, 1 (2014), pp. 5-28, .
[8] A. Crisanti, G. Paladin, and A. Vulpiani, {\it Products of Random Matrices}, Springer, New York, 1993, . · Zbl 0784.58003
[9] W. Dörfler, {\it A convergent adaptive algorithm for Poisson’s equation}, SIAM J. Numer. Anal., 33 (1996), pp. 1106-1124, . · Zbl 0854.65090
[10] A. Ern and J.-L. Guermond, {\it Theory and Practice of Finite Elements}, Appl. Math. Sci. 159, Springer, New York, 2004, . · Zbl 1059.65103
[11] W. Hackbusch, {\it Multi-Grid Methods and Applications}, Springer Ser. Comput. Math. 4, Springer-Verlag, Berlin, 1985, . · Zbl 0595.65106
[12] W. Hackbusch, {\it Iterative Solution of Large Sparse Systems of Equations}, Appl. Math. Sci. 95, Springer-Verlag, New York, 1994, . · Zbl 0789.65017
[13] T. Herault and Y. Robert, {\it Fault-Tolerance Techniques for High-Performance Computing}, Springer, Cham, 2015, . · Zbl 1330.68026
[14] G. Karypis and V. Kumar, {\it A fast and high quality multilevel scheme for partitioning irregular graphs}, SIAM J. Sci. Comput., 20 (1998), pp. 359-392, . · Zbl 0915.68129
[15] T. Kröger and T. Preusser, {\it Stability of the 8-tetrahedra shortest-interior-edge partitioning method}, Numer. Math., 109 (2008), pp. 435-457, . · Zbl 1151.65094
[16] M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, et al., {\it Addressing failures in exascale computing}, Internat. J. High Perform. Comput. Appl., 28 (2014), pp. 129-173, .
[17] U. Trottenberg, C. W. Oosterlee, and A. Schüller, {\it Multigrid}, Academic Press, San Diego, CA, 2001. With contributions by A. Brandt, P. Oswald, and K. Stüben.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.