Dependable Systems (2010)

Summer 2010

Please note: Oral exam takes place in A1.1.

Description

Continous service provisioning is a key feature of modern hardware und software server systems. These systems achieve their level of user-perceived availability through a set of formal and technical approaches, commonly summarized under the term dependability.

Dependability is defined as the trustworthiness of hardware and software systems, so that reliance can be placed on the service they provide. The main dependability attributes commonly known and accepted are availability, reliability, safety, and security.

The Dependable Systems course gives an introduction into theoretical foundations, common building blocks and example implementations for dependable IT components and systems. The focus is on reliability and availability aspects of dependable systems, such as reliability analysis, fault tolerance, fault models or failure prediction. Amongst other things, the following topics are covered:

Dependability definitions and metrics
Design patterns for fault tolerance
Analytical evaluation of system dependability
Hardware dependability approaches
Software dependability approaches
Latest research topics

This course is an extended and adjusted version of the 'Dependable Systems' course by Prof. M. Malek and Dr. F. Salfner at Humboldt University, Computer Architecture and Communication Group

Regularities

Students taking this course need to have basic knowledge in operating systems and middleware technology. On request of at least one participant, the course will be given in English. The course contains of two modules: Lectures and the group projects. The successful completion of the project work demands practical experiments in one of the given topics. The results of these experiments must be described in a written report. The pass-grading of the report is the mandatory precondition for taking the oral exam. The final course grade is the oral exam grade.

Slides

Introduction
Definitions and Metrics
Dependability Impairments
Dependability Attributes
Dependability Means
Fault Tolerance Patterns (I)
Fault Tolerance Patterns (II)
Quantitative Evaluation (I)
Quantitative Evaluation (II)
Quantitative Evaluation (III)
Qualitative Evaluation
Hardware Testing
Hardware Diagnosis
Hardware Redundancy
Software Dependability
Trends in Software Dependability (Additional slides by Sun and Dave Patterson)
Distributed Systems
Case Studies
Proactive Fault Management (guest lecture by Dr. Felix Salfner)

Project Reports

Dates

Lectures: Tue, 13:00 - 14:30 / Wed, 13:00 - 14:30
Project Decision / Final Course Enrollment: May 14
Project Presentation: July 13-14 (see below)
Project Report Submission: July 31 (5-25 pages)

Oral Exam Date	Student
3.8.2010, 09:00 - 09:30	Edgar Naether
3.8.2010, 09:30 - 10:00	Sven Wagner-Boysen
3.8.2010, 10:00 - 10:30	Benjamin Karran
3.8.2010, 10:30 - 11:00	Marko Röder
3.8.2010, 11:00 - 11:30	Christopher Schuster
3.8.2010, 11:30 - 12:00	Richard Metzler
3.8.2010, 13:30 - 14:00	Matthias Richly
3.8.2010, 14:30 - 15:00
30.9.2010, 09:00 - 09:30	Paul Römer
30.9.2010, 09:30 - 10:00	Martin Schütte
30.9.2010, 10:00 - 10:30	Norman Kluge
30.9.2010, 10:30 - 11:00	Ingo Jaeckel
30.9.2010, 11:00 - 11:30	Jan Schütze
30.9.2010, 11:30 - 12:00	Frank Zschockelt
30.9.2010, 13:30 - 14:00	Jan Brunnert
30.9.2010, 14:30 - 15:00

Project Work

The knowledge gained from the lectures has to be applied in practical project work. Students need to form groups of 2-3 persons and work jointly on a dependability experiment from one of the topics given below. Each of the topics is supervised by a member of the Operating Systems and Middleware Group. The achieved study results have to presented at the end of the semester, as well as documented in a report. Every group needs to answer the following questions within their oral / written result presentation:

How does the product / solution compare to similar solutions ?
What are the installation / operational experiences ?
What is the supported fault model on the different hardware / software layers in the investigated solution ?
What error states are supported / reported ?
What is the chosen redundancy approach ? In case of data replication, what is the functional extend and the consistency model ?
Is there any specified down-time during error recovery resp. compensation ?
What is the performance impact of the chosen fault tolerance technique ?

Project Topic List

Clustering of OpenVMS installations for high availability (Norman Kluge)

Supervisor: Bernhard Rabe
Presentation: July 13th, 13:00

Comparing clustering solutions for J2EE application servers (Edgar Näther, Sven Wagner-Boysen)

Products: JBoss and GlassFish application servers
Analysis of FT-clustering capabilities for web tier and business tier
Documentation of installation experience
How good is the clustered deployment of applications solved ?
Performance comparison of clustered solutions (2-tier J2EE application, no data tier)
Supervisors: Frank Feinbube, Robert Wierschke
Presentation: July 13th, 13:30

Virtualization Fault Tolerance - Feature analysis (Matthias Richly, Christopher Schuster)

Analyze available solutions for fault-tolerant operation of virtual machines
In-depth investigation of one particular product, preferably VMWare
Analysis of failover capabilities (e.g. open network connections)
Supervisor: Bernhard Rabe
Presentation: July 13th, 14:00

Dependability analysis of railroad design alternatives (free)

Supervisors: Uwe Hentschel, Jan-Arne Sobarnia

Software-implemented fault injection in Windows (Benjamin Karran)

Implementation of an operating system-based fault injector as driver
Interception of system calls for chosen processes and threads (e.g. disk full notification)
Comparative test with different applications regarding their error handling capabilities
Supervisor: Peter Tröger
Presentation: July 13th, 14:30

Software-implemented fault injection in Linux (Frank Zschockelt, Paul Römer)

Implementation of an operating system-based fault injector as driver
Injection of single bit stuck-at and bit-flip faults at chosen locations
Comparative test with different fault locations under the fail-stop fault model
Presentation: July 14th, 13:00

FT CORBA (free)

Supervisor: Martin v. Löwis

Windows Cluster Services for High Availability (free)

Supervisor: Alexander Schmidt

Linux HA Cluster - Available solutions and their properties (Jan Brunnert, Ingo Jaeckel)

Supervisor: Peter Tröger
Presentation: July 14th, 13:30

Non-relational databases (Richard Metzler, Jan Schütze)

Comparison of different NoSQL database products (e.g. Cassandra, CouchDB, RIAK, Gizzard)
Deeper comparison of chosen subset with respect to: fault model, replication approach, data lookup handling, consistency model
Supervisor: Peter Tröger
Presentation: July 14th, 14:00

Distributed Fault-Tolerant File Systems (Martin Schütte, Marko Roeder)

Comparison of different distributed file systems with respect to the questions above
Examples: Coda, Microsoft DFS, dCache, IBM GPFS, Hadoop HDFS
Presentation: July 14th, 14:30