System-level resource monitoring in high-performance computing environments. (English) Zbl 1076.68513

J. Grid Comput. 1, No. 3, 273-289 (2003).
Summary: Low-overhead resource monitoring is key to the successful management of distributed high-performance computing environments, particularly when applications have well-defined quality of service requirements. The dproc system-level monitoring mechanisms provide tools both for efficiently monitoring system-level events and for notifying remote hosts of events relevant to their operation. Implemented as extension to the Linux kernel, dproc provides several key functions. First, utilizing the familiar /proc virtual filesystem, dproc extends this interface with resource information collected from both local and remote hosts. Second, to predictably capture and distribute monitoring information, dproc uses a kernel-level group communication facility, termed KECho, which implements events and event channels. Third, and the focus of this paper, is dproc’s run-time customizability for resource monitoring, which includes the generation and deployment of monitoring functionality within remote operating system kernels. Using dproc, we show that (a) data streams can be customized according to a client’s resource availabilities (dynamic stream management), (b) by dynamically varying distributed monitoring (dynamic filtering of monitoring information), an appropriate balance can be maintained between monitoring overheads and application quality, and (c) by performing monitoring at kernel-level, the information captured enables decision making that takes into account the multiple resources used by applications.


68M14 Distributed systems
68M20 Performance evaluation, queueing, and scheduling in the context of computer systems


Full Text: DOI