A Method and Tool for Analyzing Fault-Tolerance in Systems

Abstract:

As computers are integrated into systems that have stringent fault-tolerance requirements, there is a growing need for techniques to establish that these systems actually satisfy those requirements. Informal arguments do not supply the desired level of assurance for critical systems. This dissertation presents a rigorous, automated approach to analyzing distributed systems, with a focus on checking fault-tolerance requirements, and describes a prototype implementation of the analysis. The analysis is a novel hybrid of ideas from stream-processing semantics of networks of processes, abstract interpretation of programs, and symbolic computation. The underlying principles of the analysis method are general, but specialized techniques---such as the use of perturbations to represent changes to normal behavior caused by failures---are developed to deal efficiently with the types of systems and requirements that arise in establishing fault-tolerance. The method is illustrated with three examples: the Oral Messages algorithm for Byzantine Agreement, due to Lamport, Shostak and Pease, a standard protocol for FIFO reliable broadcast, and a (subtly) flawed protocol for fault-tolerant moving agents.
Scott Stoller's Home Page