Distributed Fault-Tolerance — Lessons from Delta-4

D. Powell

 

Abstract

Software-implemented approaches to fault tolerance are very resilient to change since evolution in hardware technology does not require extensive re-design of specialized hardware. This paper argues the case for implementing fault tolerance in a distributed fashion and reports the approach adopted in the European Delta-4 project. Fault tolerance is achieved by replicating capsules (the run-time representation of application objects) on distributed nodes interconnected by a local area network. Capsule groups can be configured to tolerate either stopping failures or arbitrary failures. Multipoint protocols are used for coordinating capsule groups and for error processing and fault treatment. The paper concludes with a critical analysis of the project’s results.