zbMATH — the first resource for mathematics

Fault-tolerance mechanisms for a parallel programming system – a responsiveness perspective. (English) Zbl 0969.68507
Hommel, G√ľnter (ed.), Communication-based systems. Proceedings of the 3rd international workshop, TU Berlin, Germany, March 31 - April 1, 2000. Dordrecht: Kluwer Academic Publishers. 43-54 (2000).
Summary: Clusters of workstations are an attractive environment for high performance computing. For some applications, however, clusters still lack certain properties. One such property is responsive (dependable and timely) execution of programs. This paper studies two mechanisms (checkpointing and replication) to improve the responsiveness (the probability of meeting a deadline in the presence of faults) of a parallel programming system, Calypso, by ameliorating a single point of failure of Calypso. Experiments show that checkpointing is a suitable tool to achieve high responsiveness and that already a very modest degree of replication is sufficient for improved responsiveness.
For the entire collection see [Zbl 0934.00022].
68N19 Other programming paradigms (object-oriented, sequential, concurrent, automatic, etc.)