×

Implementation of fault-tolerant GridRPC applications. (English) Zbl 1099.68530

J. Grid Comput. 4, No. 2, 145-157 (2006).
Summary: A task parallel application is implemented with Ninf-G, a GridRPC system. A series of experiments are conducted on the Grid testbed in Asia Pacific for three months. Through tens of long executions, typical fault patterns were collected, and instability of the network throughput was determined to be a major reason of the faults. Several important points are stressed to avoid task throughput decline due to the fault-recovery operations: Timeout minimization for fault detection, background recovery, redundant task assignments, and so on. This study also issues a steer for design of the automated fault-tolerant mechanism in an upper layer of the GridRPC framework.

MSC:

68M15 Reliability, testing and fault tolerance of networks and computer systems
68M10 Network design and communication in computer systems

Keywords:

GridRPC system
Full Text: DOI

References:

[1] Allen, G., Dramlitsch, T., Foster, I., Karonis, N.T., Ripeanu, M., Seidel, E., Toonen, B.: Supporting efficient execution in heterogeneous distributed computing environments with cactus and globus. In: Proceedings of Supercomputing, 2001
[2] ApGrid. http://www.apgrid.org/
[3] Arnold, D., Agrawal, S., Blackford, S., Dongarra, J., Miller, M., Seymour, K., Sagi, K., Shi, Z., Vadhiyar, S.: Users’ Guide to NetSolve V1.4.1. Innovative Computing Dept. Technical Report ICL-UT-02-05, University of Tennessee, 2002
[4] Bosilca, G., Bouteiller, A., Cappello, F., DjiLali, S., Fédak, G., Germain, C., Hérault, T., Lodygensky, P.L. a d O., Magniette, F., Néri, V., Selikhov, A.: Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In: Proceeding of Supercomputing, 2002
[5] Buyya, R., Abramson, D., Giddy, J.: Nimrod/G: An architecture of resource management and scheduling system in a global computational Grid. In: Proceedings of HPC Asia, pp. 283–289, 2000
[6] Casanova, H., Dongarra, J.: Netsolve: A network server for solving computational science problems. Int. J. Supercomput. Appl. High Perform. Comput. 11(3), 212–223 (1997) · doi:10.1177/109434209701100304
[7] Chen, W., Toueg, S., Aguilera, M.K.: On the quality of service of failure detection. IEEE Trans. Comput. 51(5), 561–580 (2002) · doi:10.1109/TC.2002.1004595
[8] Fagg, G.E., Bukovsky, A., Dongarra, J.J.: HARNESS and fault tolerant MPI. Parallel Comput. 27, 1479–1496 (2001) · Zbl 0982.68066 · doi:10.1016/S0167-8191(01)00100-4
[9] Foster, I., Kesselman, C.: Globus: A metacomputing infrastructure toolkit. Int. J. Supercomput. Appl. High Perform. Comput. 11(2), 115–128 (1997) · doi:10.1177/109434209701100205
[10] Goux, J., Kulkarni, S., Linderoth, J., Yoder, M.: An enabling framework for master–worker applications on the computational Grid. In: Proceedings of HPDC-9, pp. 43–50, 2000
[11] Ikegami, T., Takemiya, H., Nagashima, U., Tanaka, Y., Sekiguchi, S.: Accurate molecular simulation on the Grid – Replica exchange Monte Carlo simulation for C 20 molecule. Journal of Information Processing Society of Japan 44(SIG11), 14–22 (2003)
[12] Nakada, H., Tanaka, Y., Matsuoka, S., Sekiguchi, S.: The design and implementation of a fault-tolerant RPC system: Ninf-C. In: Proceedings of HPC Asia, pp. 9–18, 2004
[13] PRAGMA. http://www.pragma-grid.net/
[14] Pruyne, J., Livny, M.: Managing checkpoints for parallel programs. In: Proceedings of Workshop on Job Scheduling Strategies for Parallel Processing, 1996
[15] Seymour, K., Nakada, H., Matsuoka, S., Dongarra, J., Lee, C., Casanova, H.: Overview of GridRPC: A remote procedure call API for Grid computing. In: Parashar, M. (ed.) Proceedings of 3rd International Workshop on Grid Computing, pp. 274–278, 2002 · Zbl 1024.68846
[16] Takemiya, H., Shudo, K., Tanaka, Y., Sekiguchi, S.: Constructing Grid applications using standard Grid middleware. Grid Computing 1, 117–131 (2003) · Zbl 02224863 · doi:10.1023/B:GRID.0000024070.19388.8d
[17] Tanaka, Y., Takemiya, H., Nakada, H., Sekiguchi, S.: Design, implementation and performance evaluation of GridRPC programming middleware for a large-scale computational Grid. Fifth IEEE/ACS International Workshop on Grid Communicating, pp. 298–305, 2005
[18] Yabana, K., Bertsch, G.F.: Time-dependent local-density approximation in real time: Application to conjugated molecules. Quantum Chemistry 75, 55–66 (1999) · doi:10.1002/(SICI)1097-461X(1999)75:1<55::AID-QUA6>3.0.CO;2-K
[19] GGF. http://www.gridforum.org/
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.