Detecting causal relationships in distributed computations: In search of the holy grail. (English) Zbl 0813.68096

Summary: The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concerning the characterization of causality are presented. Recent work on the detection of causal relationships in distributed computations is surveyed. The issue of observing distributed computations is a causally consistent way and the basic problems of detecting global predicates are discussed. To illustrate the major difficulties, some typical monitoring and debugging approaches are assessed,and it is demonstrated how their feasibility is severely limited by the fundamental problem to master the complexity of causal relationships.


68Q10 Modes of computation (nondeterministic, parallel, interactive, probabilistic, etc.)


Algorithm 97
Full Text: DOI


[1] Acharya A, Badrinath BR: Recording distributed snapshots based on causal order of message delivery. Inf Process Lett 44: 317–321 (1992) · Zbl 0795.68008
[2] Ahmad M, Hutto PW, John R: Implementing and programming causal distributed shared memory. Proc 11th Int Conference on Distributed Computing Systems, Arlington, Texas, pp 274–281, 1991
[3] Ahuja M, Carlson T, Gahlot A, Shands D: Timestamping events for inferring ’affects’ relation and potential causality. Proc 15th IEEE Int Computer Software and Application Conference COMPSAC 91, Tokyo, pp 606–611, 1991
[4] Baldy P, Dicky H, Medina R, Morvan M, Vilarem JF: Efficient reconstruction of the causal relationship in distributed computations. Technical Report 92-013, Laboratoire d’Informatique, de Robotique et de Microélectronique de Montpellier, Montpellier 1992
[5] Bates PC, Wileden JC: High-level debugging of distributed systems: the behavioral abstraction approach. J Syst Softw 4(3): 255–264 (1983) · Zbl 05432178
[6] Birman K, Joseph T: Exploiting virtual synchrony in distributed systems. Operating Syst Rev 22(1): 123–138 (1987)
[7] Birman K: The process group approach to reliable distributed computing. Technical Report, Computer Science Department, Cornell University, Ithaca, New York 1991
[8] Birman K, Schiper A, Stephenson P: Lightweight causal and atomic group multicast. ACM Trans Comput Syst 9(3): 272–314 (1991)
[9] Bruegge B, Hibbard P: Generalized Path Expressions. J Syst Softw 2(2): 265–276 (1983) · Zbl 05432349
[10] Chandy KM, Lamport L: Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 31(1): 63–75 (1985)
[11] Charron-Bost B: Combinatorics and geometry of consistent cuts: application to concurrency theory. In: Bermond JC, Raynal M (eds) Distributed algorithms. LNCS, vol 392. Springer, Berlin Heidelberg New York 1989, pp 45–56
[12] Charron-Bost B: Concerning the size of logical clocks in distributed systems. Inf Process Lett 39: 11–16 (1991) · Zbl 0735.68003
[13] Charron-Bost B, Delporte-Gallet C, Fauconnier H: Local and temporal predicates in distributed systems. Technical Report, LITP, IBP, Université Paris 7, Paris 1992
[14] Charron-Bost B, Mattern F, Tel G: Synchronous, asynchronous, and causally ordered communication. Distrib Comput (to appear)
[15] Cooper R, Marzullo K: Consistent detection of global predicates. Proc ACM/ONR Workshop on Parallel and Distributed Debugging, Santa Cruz, California 1991, pp 163–173
[16] Diehl C, Jard C: Interval approximations and message causality in distributed systems. In: Finkel A, Jantzen M (eds) Proc of the 9th Annual Symposium on Theoretical Aspects of Computer Science STACS ’92, LNCS, vol 577, Springer, Berlin Heidelberg New York 1992, pp 363–374
[17] Fidge CJ: Timestamps in message-passing systems that preserve the partial ordering. Proc 11th Australian Computer Science Conference, University of Queensland, pp 55–66, 1988
[18] Fidge CJ: Partial orders for parallel debugging. ACM SIGPLAN Notices 24(1): 183–194 (1989)
[19] Fidge CJ: Dynamic analysis of event orderings in message passing systems. PhD Thesis, Department of Computer Science, Australian National University, Canberra 1989
[20] Fidge CJ: Logical time in distributed computing systems. IEEE Comput 24(8): 28–33 (1991)
[21] Fischer MJ, Michael A: Sacrificing serializability to attain high availability of data in an unreliable network. Proc ACM SIGACT-SIGOPS Symposium on Principles of Database Systems, pp 70–75, 1982
[22] Fishburn PC: Interval orders and interval graphs. Wiley, New York 1985 · Zbl 0568.05047
[23] Floyd RW: Algorithm 97, shortest path. Commun ACM 5: 345 (1962)
[24] Fowler J, Zwaenepoel W: Causal distributed breakpoints. Proc 10th Int Conference on Distributed Computing Systems, Paris, pp 134–141, 1990
[25] Gait J: A probe effect in concurrent programs. Softw Pract Exper 16 (3): 225–233 (1986)
[26] Garg VK, Waldecker B: Detection of unstable predicates in distributed programs. In: Shyamasundra R (ed) Proc 12th Conference on the Foundation of Software Technology and Theoretical Computer Science, LNCS, vol 652. Springer, Berlin Heidelberg New York 1992, pp 253–264
[27] Haban D, Weigel W: Global events and global breakpoints in distributed systems. Proc 21st Annual Hawaii Int Conference on System Sciences, pp 166–175, 1988
[28] Haban D, Zhou S, Maurer D, Wilhelm R: Specification and detection of global breakpoints in distributed systems. Technical Report SFB124-08/1991, Universität des Saarlandes, Saarbrücken 1991
[29] Harel D, Pnueli A: On the development of reactive systems. In: Apt K (ed) Logics and models of concurrent systems, NATO ASI Series F, vol 13. Springer, Berlin Heidelberg New York 1985, pp 477–498 · Zbl 0581.68046
[30] Hseush W, Kaiser GE: Modeling concurrency in parallel debugging. ACM SIGPLAN Notices 25(3): 11–20 (1990)
[31] Hutto P, Ahamad M: Slow memory: weakening consistency to enhance concurrency in distributed shared memories. Proc 10th Int Conference on Distributed Computing Systems, Paris, pp 302–309, 1990
[32] Johnson DB, Zwaenepoel W: Recovery in distributed systems using optimistic message logging and checkpointing. J Algorithms 11(3): 462–491 (1990) · Zbl 0711.68009
[33] Katz S, Peled D: Interleaving set temporal logic. Theor Comput Sci 75: 263–287 (1990) · Zbl 0718.03014
[34] Katz S, Peled D: Verification of distributed programs using representative interleaving sequences. Distrib Comput 6: 107–120 (1992) · Zbl 0773.68053
[35] Kearns JP, Koodalattupuram B: Immediate ordered service in distributed systems. Proc 9th Int Conference on Distributed Computing Systems, Newport Beach, California, 611–618, 1989
[36] Lamport L: Time, clocks, and the ordering of events in a distributed system. Commun ACM 21 (7): 558–565 (1978) · Zbl 0378.68027
[37] LeBlanc TJ, Mellor-Crummey JM: Debugging parallel programs with instant replay. IEEE Trans Comput 36(4): 471–482 (1987)
[38] Leu E, Schiper A, Zramdini A: Efficient excution replay for distributed memory architectures. In: Bode A (ed) Proc 2nd European Distributed Memory Computing Conference, Munich, Germany, LNCS, vol 487. Springer, Berlin Heidelberg New York 1991, pp 315–324
[39] Manabe Y, Imase M: Global conditions in debugging distributed programs. J Parallel Distrib Comput 15: 62–69 (1992) · Zbl 0763.68027
[40] Manna Z, Pnueli A: The temporal logic of reactive and concurrent systems. Springer, Berlin Heidelberg New York 1992 · Zbl 0753.68003
[41] Marzullo K, Neiger G: Detection of global state predicates. In: Toueg S, Spirakis PG, Kirousis L (eds) Proc 5th Workshop on Distributed Algorithms (WDAG-91), Delphi, Greece, LNCS, vol 579. Springer, Berlin Heidelberg New York 1991, pp 254–272
[42] Masuzawa T, Tokura N: A causal distributed breakpoint algorithm for distributed debugger. Proc ICICE Fall Conf (SD-1-8) 6: 373–374 (1992)
[43] Mattern F: Algorithms for distributed termination detection. Distrib Comput 2: 161–175 (1987)
[44] Mattern F: Virtual time and global states in distributed systems. In: Cosnard M et al. (eds) Proc Workshop on Parallel and Distributed Algorithms, Chateau de Bonas, Oct. 1988, Elsevier, North Holland, 1989, pp 215–226
[45] Mattern F: Efficient algorithms for distributed snapshots and global virtual time approximation. J Parallel Distrib Comput 18: 423–434 (1993)
[46] Medina R: Incremental garbage collection of causal relationship computation in distributed systems. Proc 5th IEEE Symposium on Parallel and Distributed Processing, Irving, Texas, 1993
[47] Meldal S, Sankar S, Vera J: Exploiting locality in maintaining potential causality. Proc 10th Annual ACM Symposium on Principles of Distributed Computing, Montreal, Canada, pp 231–239, 1991 · Zbl 1314.68383
[48] Miller BP, Choi JD: Breakpoints and halting in distributed programs. Proc 8th Int Conference on Distributed Computing Systems, pp 316–323, 1988
[49] Minkowski H: Raum und Zeit. In: Lorentz HA, Einstein A, Minkowski H: Das Relativitätsprinzip. Eine Sammlung von Abhandlungen. Teubner, Leipzig 1915, pp 56–68
[50] Netzer RHB, Miller BP: Optimal tracing and replay for debugging message-passing parallel programs. Proc Supercomputing ’92, Minneapolis, pp 502–511, 1992
[51] Nielsen M, Plotkin G, Winskel G: Petri nets, event structures and domains, part I. Theor Comput Sci 13: 85–108 (1981) · Zbl 0452.68067
[52] Ochmanski E: Inevitability in concurrent systems. Inf Process Lett 25: 221–225 (1987) · Zbl 0653.68011
[53] Panangaden P, Taylor K: Concurrent common knowledge: defining agreement for asynchronous systems. Distrib Comput 6(2): 73–93 (1992) · Zbl 0773.68009
[54] Ponamgi MK, Hseush W, Kaiser GE: Debugging multithreaded programs with MPD. IEEE Software, 8(3): 37–43 (1991) · Zbl 05101574
[55] Pratt V: Modeling concurrency with partial orders. Int J Parallel Program 15(1): 33–71 (1986) · Zbl 0622.68034
[56] Pratt V: Modeling concurrency with geometry. Proc 18th Annual Symposium on Principles of Programming Languages (POPL-91), pp 311–322, 1991
[57] Pratt V: Arithmetic+Logic+Geometry=Concurrency. In: Simon I (ed) Proc LATIN ’92, LNCS, vol 583. Springer, Berlin Heidelberg New York 1992, pp 430–447
[58] Raynal M, Schiper A, Toueg S: The causal ordering abstraction and a simple way to implement it. Inf Process Lett 39: 343–350 (1991) · Zbl 0748.68026
[59] Reisig W: A strong part of concurreney. In: Rozenberg G (ed) Advances in Petri nets, LNCS, vol 266. Springer, Berlin Heidelberg New York 1987, pp 238–272
[60] Reisig W: Temporal logic and causality in concurrent systems. In: Vogt FH (ed) Proc Concurrency ’88, LNCS, vol 335. Springer, Berlin Heidelberg New York 1988, pp 121–139 · Zbl 0663.68039
[61] Reisig W: Parallel composition of liveness. Technical Report SFB342/30/91A, Technische Universität München, München 1991
[62] van Renesse R: Causal controversy at Le Mont St.-Michel. ACM Operating Syst Rev 27(2):44–53 (1993)
[63] Sandoz A, Schiper A: A characterization of consistent distributed snapshots using causal order. Technical Report 92-14, Département d’Informatique, Ecole Polytechnique Fédérale de Lausanne, Lausanne 1992
[64] Schiper A, Eggli J, Sandoz A: A new algorithm to implement causal ordering. In: Bermond JC, Raynal M (eds) Proc Workshop on Distributed Algorithms, Nice, France, LNCS, vol 392. Springer, Berlin Heidelberg New York 1989, pp 219–232
[65] Singhal M, Kshemkalyani A: An efficient implementation of vector clocks. Inf Process Lett 43: 47–52 (1992) · Zbl 0780.68050
[66] Spezialetti M, Kearns JP: Simultaneous regions: a framework for the consistent monitoring of distributed systems. Proc 9th Int Conference on Distributed Computing Systems, Newport Beach, California, pp 61–68, 1989
[67] Stone JM: Visualizing concurrent processes. Technical Report RC 12973, IBM TJ Watson Research Center 1987
[68] Stone JM: Debugging concurrent processes: a case study. Proc SIGPLAN Conf on Programming Language Design and Implementation, Atlanta, Georgia, pp 145–153, 1988
[69] Stone JM: A graphical representation of concurrent processes. ACM SIGPLAN Notices 24(1): 226–235 (1989)
[70] Strom R, Yemini S: Optimistic recovery in distributed systems. ACM Trans Comput Syst 3 (3): 204–226 (1985)
[71] Szpilrajn E: Sur l’extension de l’ordre partiel. Fund Math 16: 386–389 (1930) · JFM 56.0843.02
[72] Warshall S: A theorem on Boolean matrices. J ACM 9: 11–12 (1962) · Zbl 0118.33104
[73] Winskel G: An introduction to event structures. In: de Bakker JW, de Roever WP, Rozenberg G (eds) Proc Workshop on Linear Time, Branching Time and Partial Order in Logics and Models for Concurrency, Noordwijkerhout, The Netherlands, LNCS, vol 354. Springer, Berlin Heidelberg New York 1988, pp 364–397
[74] Wuu GTJ, Bernstein AJ: Efficient solutions to the replicated log and dictionary problems. Proc ACM Symposium on Principles of Distributed Computing, pp 233–242, 1984
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.