Efficient Checkpoint Mechanisms for Massively Parallel Machines

Abstract:


Massively parallel SIMD machines such as DECmpp 12000 typically contain thousands of processor units and therefore are more likely to suffer system breakdown because of component failures. This work studies efficient diskless checkpointing mechanisms for SIMD massively parallel machines. Three checkpointing schemes, mirror checkpointing, parity checkpointing, and partial parity checkpointing are compared in terms of their checkpoint performance and storage overheads, based on measurements from real programs. Mirror checkpointing and parity checkpointing schemes have been successfully implemented and tested on a DECmpp 12000 machine, without hardware or OS modifications. We have shown that mirror checkpointing is an order of magnitude faster than parity checkpointing, but takes twice as much storage overhead. Partial parity checkpointing, although significantly reduces the storage overhead, could lead to unpredictable execution performance. This talk also examines the detailed storage/performance tradeoffs for partial parity checkpointing through manual instrumentation, and describes the implementation experience we gained from the experiments.


Back to ORS homepage