Fault Tolerance Seminar (2014)
Prof. Dr. Andreas Polze
Daniel Richter
In distributed software systems, fault tolerance denotes the ability to continue operating in the
presence of component failures.
The importance of fault tolerance has grown steadily with the increasing complexity and distribution of
technical systems.
Various forms of redundancy in space and time can be employed to build highly available or reliable
software. In order to make a software system robust against many faults, a detailed understanding of its
distributed architecture, the component interfaces, and its probabilistic failure behaviour is necessary.
This seminar provides an overview of how fault tolerance, as a central dependability requirement, is achieved in
modern software systems.
Topics covered in the seminar include distributed systems theory, patterns for fault tolerance, as well as fault
tolerance case studies at the operating systems and distributed application levels.
Introduction Slides
Organization
Extent: 2 semester hours (3 graded credit points)
Dates: Thursday, 13.30 - 15.00
Identifiers (SO2010): SAMT, OSIS
The seminar focusses on literature review. Participants are required to read fundamental scientific publications on fault tolerance in distributed systems, and present them to their fellow students.- Each participant is expected to give a 30-45 minute presentation on a topic.
- Presentation slides should be discussed with a supervisor one week prior to the presentation date.
- At the end of the seminar, we plan to assemble a technical report about your seminar topics.
Topics
Topic Description
We are open for any topic suggestions. Each of the following proposed topics may be worked on by one or two students.Distributed Systems Theory
- Consensus Algorithms: Paxos, EPaxos, Chandra-Toueg, Raft
- Logical Clocks: Lamport clocks, Vector Clocks, Matrix Clocks
- Impossibilities in Distributed Systems: FLP Impossibility, CAP Theorem, 100 Impossibility Proofs
- Self-Stabilizing Systems
Fault Tolerance Patterns + Strategies
- Acceptability Oriented Computing, Failure-Oblivious Computing
- Replication Schemes: Survey, Classification
- Coding Techniques for Fault Tolerance: Defensive Programming, Design by Contract, Proof-Carrying Code, Dependability Compiler
- Recovery-Oriented Computing: Crash-Only Software, Micro-Reboot, System-wide Undo
- Biologically Inspired Approaches: Treating Bugs as Allergies, Programming Model
Fault Tolerant Applications
Presentation Dates
According to your prioritized lists of topics the seminar schedule is as follows:
Date | Presenter | Topic | Superv. |
10.04.2014 | Introduction | ||
24.04.2014 | Topic Conflict Resolution | ||
01.05.2014 | Tag der Arbeit | ||
08.05.2014 | (Buffer) | ||
15.05.2014 | Nils Kenneweg | Consensus Algorithms | LH |
Vasily Kirilichev | Impossibilities in Distributed Systems | LH | |
22.05.2014 | Nicco Kunzmann | Self-Stabilizing Systems | DR |
Andreas Grapentin | Coding Techniques for Fault Tolerance | DR | |
29.05.2014 | Himmelfahrt | ||
05.06.2014 | SAPPHIRE | ||
12.06.2014 | Paul Meinhardt | Replication Schemes | DR |
Vincent Schwarzer | Acceptability Oriented Computing | DR | |
19.06.2014 | Malte Swart | Recovery Oriented Computing | LH |
Patrick Rein | Biologically Inspired Approaches | DR | |
26.06.2014 | Symposium on Future Trends in Service-Oriented Computing 2014 | ||
03.07.2014 | (Buffer) | ||
10.07.2014 | (Buffer) | ||
!16.07.2014 | Johannes Henning | Erlang/OTP | LH |
Jan-Peer Rudolph | Dynamo | LH | |
17.07.2014 | Component Programming & Middleware |