Quaglia, Francesco; Santoro, Andrea Modeling and optimization of non-blocking checkpointing for optimistic simulation on myrinet clusters. (English) Zbl 1078.68820 J. Parallel Distrib. Comput. 65, No. 6, 667-677 (2005). Summary: Checkpointing-and-Communication Library (CCL) is a recently developed software which implements CPU offloaded, non-blocking checkpointing functionalities in support of optimistic parallel simulation on myrinet clusters. This is achieved by exploiting data transfer capabilities provided by a programmable DMA engine on board of myrinet network cards. Re-synchronization between CPU and DMA activities must sometimes be employed for several reasons, such as the maintenance of data consistency, thus adding overhead to (otherwise CPU cost-free) non-blocking checkpoint operations. In this paper we present a detailed cost model for non-blocking checkpointing and derive a performance effective re-synchronization semantic which we call minimum cost re-synchronization. With this semantic, an occurrence of re-synchronization either commits an on-going DMA based checkpoint operation (causing suspension of CPU activities) or aborts the operation (with possible increase in the expected rollback cost due to a reduced amount of committed checkpoints) on the basis of a minimum overhead expectation evaluated through the cost model. MSC: 68U20 Simulation (MSC2010) Keywords:Parallel discrete-event simulation; Checkpointing; Optimistic synchronization; Rollback-recovery; Myrinet; DMA; COTS; Performance optimization PDFBibTeX XMLCite \textit{F. Quaglia} and \textit{A. Santoro}, J. Parallel Distrib. Comput. 65, No. 6, 667--677 (2005; Zbl 1078.68820) Full Text: DOI