Fault Tolerance Seminar (2014)

Prof. Dr. Andreas Polze

Lena Herscheid
Daniel Richter

In distributed software systems, fault tolerance denotes the ability to continue operating in the presence of component failures. The importance of fault tolerance has grown steadily with the increasing complexity and distribution of technical systems. Various forms of redundancy in space and time can be employed to build highly available or reliable software. In order to make a software system robust against many faults, a detailed understanding of its distributed architecture, the component interfaces, and its probabilistic failure behaviour is necessary.

This seminar provides an overview of how fault tolerance, as a central dependability requirement, is achieved in modern software systems.
Topics covered in the seminar include distributed systems theory, patterns for fault tolerance, as well as fault tolerance case studies at the operating systems and distributed application levels.

Introduction Slides

Organization

Extent: 2 semester hours (3 graded credit points)

Dates: Thursday, 13.30 - 15.00

Identifiers (SO2010): SAMT, OSIS

The seminar focusses on literature review. Participants are required to read fundamental scientific publications on fault tolerance in distributed systems, and present them to their fellow students.
  • Each participant is expected to give a 30-45 minute presentation on a topic.
  • Presentation slides should be discussed with a supervisor one week prior to the presentation date.
  • At the end of the seminar, we plan to assemble a technical report about your seminar topics.

Topics

Topic Description

We are open for any topic suggestions. Each of the following proposed topics may be worked on by one or two students.

Distributed Systems Theory

Fault Tolerance Patterns + Strategies

Fault Tolerant Applications

Presentation Dates

According to your prioritized lists of topics the seminar schedule is as follows:

Date Presenter Topic Superv.
10.04.2014 Introduction
24.04.2014 Topic Conflict Resolution
01.05.2014 Tag der Arbeit
08.05.2014 (Buffer)
15.05.2014 Nils Kenneweg Consensus Algorithms LH
Vasily Kirilichev Impossibilities in Distributed Systems LH
22.05.2014 Nicco Kunzmann Self-Stabilizing Systems DR
Andreas Grapentin Coding Techniques for Fault Tolerance DR
29.05.2014 Himmelfahrt
05.06.2014 SAPPHIRE
12.06.2014 Paul Meinhardt Replication Schemes DR
Vincent Schwarzer Acceptability Oriented Computing DR
19.06.2014 Malte Swart Recovery Oriented Computing LH
Patrick Rein Biologically Inspired Approaches DR
26.06.2014 Symposium on Future Trends in Service-Oriented Computing 2014
03.07.2014 (Buffer)
10.07.2014 (Buffer)
!16.07.2014 Johannes Henning Erlang/OTP LH
Jan-Peer Rudolph Dynamo LH
17.07.2014 Component Programming & Middleware

Literature

(more will be added)

Fundamentals