Dependable Systems (2010)

Summer 2010

Dr. Peter Tröger

Please note: Oral exam takes place in A1.1.

Description

Continous service provisioning is a key feature of modern hardware und software server systems. These systems achieve their level of user-perceived availability through a set of formal and technical approaches, commonly summarized under the term dependability.

Dependability is defined as the trustworthiness of hardware and software systems, so that reliance can be placed on the service they provide. The main dependability attributes commonly known and accepted are availability, reliability, safety, and security.

The Dependable Systems course gives an introduction into theoretical foundations, common building blocks and example implementations for dependable IT components and systems. The focus is on reliability and availability aspects of dependable systems, such as reliability analysis, fault tolerance, fault models or failure prediction. Amongst other things, the following topics are covered:

  • Dependability definitions and metrics
  • Design patterns for fault tolerance
  • Analytical evaluation of system dependability
  • Hardware dependability approaches
  • Software dependability approaches
  • Latest research topics

This course is an extended and adjusted version of the 'Dependable Systems' course by Prof. M. Malek and Dr. F. Salfner at Humboldt University, Computer Architecture and Communication Group

Regularities

Students taking this course need to have basic knowledge in operating systems and middleware technology. On request of at least one participant, the course will be given in English. The course contains of two modules: Lectures and the group projects. The successful completion of the project work demands practical experiments in one of the given topics. The results of these experiments must be described in a written report. The pass-grading of the report is the mandatory precondition for taking the oral exam. The final course grade is the oral exam grade.

Slides

Project Reports

Dates

  • Lectures: Tue, 13:00 - 14:30 / Wed, 13:00 - 14:30
  • Project Decision / Final Course Enrollment: May 14
  • Project Presentation: July 13-14 (see below)
  • Project Report Submission: July 31 (5-25 pages)
Oral Exam Date Student
3.8.2010, 09:00 - 09:30 Edgar Naether
3.8.2010, 09:30 - 10:00 Sven Wagner-Boysen
3.8.2010, 10:00 - 10:30 Benjamin Karran
3.8.2010, 10:30 - 11:00 Marko Röder
3.8.2010, 11:00 - 11:30 Christopher Schuster
3.8.2010, 11:30 - 12:00 Richard Metzler
3.8.2010, 13:30 - 14:00 Matthias Richly
3.8.2010, 14:30 - 15:00
30.9.2010, 09:00 - 09:30 Paul Römer
30.9.2010, 09:30 - 10:00 Martin Schütte
30.9.2010, 10:00 - 10:30 Norman Kluge
30.9.2010, 10:30 - 11:00 Ingo Jaeckel
30.9.2010, 11:00 - 11:30 Jan Schütze
30.9.2010, 11:30 - 12:00 Frank Zschockelt
30.9.2010, 13:30 - 14:00 Jan Brunnert
30.9.2010, 14:30 - 15:00

Project Work

The knowledge gained from the lectures has to be applied in practical project work. Students need to form groups of 2-3 persons and work jointly on a dependability experiment from one of the topics given below. Each of the topics is supervised by a member of the Operating Systems and Middleware Group. The achieved study results have to presented at the end of the semester, as well as documented in a report. Every group needs to answer the following questions within their oral / written result presentation:

  • How does the product / solution compare to similar solutions ?
  • What are the installation / operational experiences ?
  • What is the supported fault model on the different hardware / software layers in the investigated solution ?
  • What error states are supported / reported ?
  • What is the chosen redundancy approach ? In case of data replication, what is the functional extend and the consistency model ?
  • Is there any specified down-time during error recovery resp. compensation ?
  • What is the performance impact of the chosen fault tolerance technique ?

Project Topic List

Clustering of OpenVMS installations for high availability (Norman Kluge)

  • Supervisor: Bernhard Rabe
  • Presentation: July 13th, 13:00

Comparing clustering solutions for J2EE application servers (Edgar Näther, Sven Wagner-Boysen)

  • Products: JBoss and GlassFish application servers
  • Analysis of FT-clustering capabilities for web tier and business tier
  • Documentation of installation experience
  • How good is the clustered deployment of applications solved ?
  • Performance comparison of clustered solutions (2-tier J2EE application, no data tier)
  • Supervisors: Frank Feinbube, Robert Wierschke
  • Presentation: July 13th, 13:30

Virtualization Fault Tolerance - Feature analysis (Matthias Richly, Christopher Schuster)

  • Analyze available solutions for fault-tolerant operation of virtual machines
  • In-depth investigation of one particular product, preferably VMWare
  • Analysis of failover capabilities (e.g. open network connections)
  • Supervisor: Bernhard Rabe
  • Presentation: July 13th, 14:00

Dependability analysis of railroad design alternatives (free)

  • Supervisors: Uwe Hentschel, Jan-Arne Sobarnia

Software-implemented fault injection in Windows (Benjamin Karran)

  • Implementation of an operating system-based fault injector as driver
  • Interception of system calls for chosen processes and threads (e.g. disk full notification)
  • Comparative test with different applications regarding their error handling capabilities
  • Supervisor: Peter Tröger
  • Presentation: July 13th, 14:30

Software-implemented fault injection in Linux (Frank Zschockelt, Paul Römer)

  • Implementation of an operating system-based fault injector as driver
  • Injection of single bit stuck-at and bit-flip faults at chosen locations
  • Comparative test with different fault locations under the fail-stop fault model
  • Presentation: July 14th, 13:00

FT CORBA (free)

  • Supervisor: Martin v. Löwis

Windows Cluster Services for High Availability (free)

  • Supervisor: Alexander Schmidt

Linux HA Cluster - Available solutions and their properties (Jan Brunnert, Ingo Jaeckel)

  • Supervisor: Peter Tröger
  • Presentation: July 14th, 13:30

Non-relational databases (Richard Metzler, Jan Schütze)

  • Comparison of different NoSQL database products (e.g. Cassandra, CouchDB, RIAK, Gizzard)
  • Deeper comparison of chosen subset with respect to: fault model, replication approach, data lookup handling, consistency model
  • Supervisor: Peter Tröger
  • Presentation: July 14th, 14:00

Distributed Fault-Tolerant File Systems (Martin Schütte, Marko Roeder)

  • Comparison of different distributed file systems with respect to the questions above
  • Examples: Coda, Microsoft DFS, dCache, IBM GPFS, Hadoop HDFS
  • Presentation: July 14th, 14:30

Recommended Readings

General Information

  • J. C. Laprie, A. Avizienis, and H. Kopetz, Eds., Dependability: Basic Concepts and Terminology. Secaucus, NJ, USA: Springer-Verlag, 1992.
  • D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems, Third. Wellesley, MA: A. K. Peters, Ltd., 1998.
  • W. R. Dunn, "Designing Safety-Critical Computer Systems," Computer, vol. 36, pp. 40–46, 2003.
  • G. F. Pfister, "High Availability," in In Search of Clusters, pp. 379–452.
  • D. K. Pradhan, Ed., Fault-tolerant computer system design. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1996.
  • R. Hanmer, Patterns for Fault Tolerant Software, 1st ed. John Wiley Sons, 2007.
  • R. G. Johnston, "Being Vulnerable to the Threat of Confusing Threats with Vulnerabilities," Journal of Physical Security, vol. 4, no. 2, pp. 30–34, 2010.
  • J. Boner, "Scalability, Availability Stability Patterns," 11-May-2010. [Online]. Available: http://www.slideshare.net/jboner/scalability-availability-stability-patterns. [Link]
  • O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, and M. Violante, Software-Implemented Hardware Fault Tolerance, 1st ed. Springer, 2010.
  • A. Avižienis, "Design of fault-tolerant computers," in Fall joint computer conference (AFIPS), 1967, pp. 733–743. [Link]

Case Studies

  • B. Schroeder, E. Pinheiro, and W.-D. Weber, "DRAM errors in the wild: a large-scale field study," in SIGMETRICS ’09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, 2009, pp. 193–204.
  • Y. C. Yeh, "Triple-Triple Redundant 777 Primary Flight Computer," in IEEE Aerospace Applications Conference, 1996, pp. 293–307.
  • T. Durkin, "What the Media Couldn’t Tell You About Mars Pathfinder," Robot Science Technology, no. 1, 1998.
  • L. A. Keith, "Advisory Circular - System Design and Analysis." U.S. Department of Transportation, 21-Jun-1988.
  • G. Ramohalli, "The Honeywell on-board diagnostic and maintenance system for the Boeing 777," in IEEE/AIAA 11th Digital Avionics Systems Conference, 1992, pp. 485–490.
  • Y. C. Bob Yeh, "Design Considerations in Boeing 777 Fly-By-Wire Computers," High-Assurance Systems Engineering, IEEE International Symposium on, vol. 0, 1998.
  • R. Hess, "Computing platform architectures for robust operation in the presence of lightning and other electromagnetic threats," in 16th Digital Avionics Systems Conference (DASC), 1997.
  • D. Bernick, B. Bruckert, P. Del Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen, "NonStop Advanced Architecture," in International Conference on Dependable Systems and Networks (DSN), 2005, vol. 0, pp. 12–21.
  • W. Bartlett and L. Spainhower, "Commercial Fault Tolerance: A Tale of Two Systems," IEEE Trans. Dependable Secur. Comput., vol. 1, no. 1, pp. 87–96, 2004.
  • B. Schroeder and G. A Gibson, "Understanding failures in petascale computers," Journal of Physics: Conference Series, vol. 78, no. 1, 2007.
  • E. B. Nightingale, J. R. Douceur, and V. Orgovan, "Cycles, cells and platters: An empirical analysis of hardware failures on a million consumer PCs," in Sixth conference on Computer systems (EuroSys), 2011, pp. 343–356. [Link]
  • S. Burger, O. Hummel, and M. Heinisch, "Airbus Cabin Software," IEEE Software, vol. 30, no. 1, pp. 21–25, 2013.
  • M. W. Winter, Software Fault Tree Analysis of an Automated Control System Device Written in Ada. Naval Postgraduate School, 1995. [Link]

Analytical Evaluation

  • D. A. Menasce and V. A.F. Almeida, Capacity Planning for Web Services: Metrics, Models, and Methods. Prentice Hall, 2002.
  • A. Sathaye, S. Ramani, and K. S. Trivedi, "Availability Models in Practice." . [Link]
  • J. F. Meyer, "On Evaluating the Performability of Degradable Computing Systems," IEEE Trans. Comput., vol. 29, no. 8, pp. 720–731, Aug. 1980.
  • "Fehlermöglichkeits- und Einflussanalyse (FMEA) nach QS-9000." [Link]
  • W. E. Weseley, F. F. Goldberg, N. H. Roberts, and D. F. Haasl, Fault Tree Handbook. Washington, D.C.: US Nuclear Regulatory Commission, 1981.
  • D. W. Vesley, D. J. Dugan, J. Fragole, J. Minarik II, and J. Railsback, "Fault tree handbook with aerospace applications," NASA Office of Safety and Mission Assurance, NASA Headquarters, Washington DC, vol. 20546, 2002.
  • NASA Scientific Program Information, "Fault Tree Analysis," National Aeronautics and Space Administration, Jul. 2000.
  • M. Malhotra and K. S. Trivedi, "Dependability modeling using Petri-nets," Reliability, IEEE Transactions on, vol. 44, pp. 428–440, Sep. 1995.
  • J. Bechta Dugan, S. J. Bavuso, and M. A. Boyd, "Fault trees and sequence dependencies," in Reliability and Maintainability Symposium (RAMS), pp. 286–293.
  • R. La Band and J.D. Andrews, "Phased mission modelling using fault tree analysis," in Proceedings of the Institution of Mechanical Engineers, 2004.
  • N. Limnios, Fault Trees, vol. 1. Great Britain: ISTE, 2007.
  • M. Malhotra and K. S. Trivedi, "Power-hierarchy of dependability-model types," IEEE Transactions on Reliability, vol. 43, no. 3, pp. 493–502, Sep. 1994.

Failure Prediction

  • F. Salfner, P. Tröger, and S. Tschirpke, "Cross-Core Event Monitoring For Processor Failure Prediction," in The International Conference on High Performance Computing Simulation, Workshop on Dependable Multi-Core Computing (DMCC), 2009, pp. 67–73.
  • J. Carreira, H. Madeira, and J. G. Silva, "Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers," IEEE Transactions on Software Engineering, vol. 24, no. 2, pp. 125–136, 1998.
  • H. Ziade, R. A. Ayoubi, and R. Velazco, "A Survey on Fault Injection Techniques," Int. Arab J. Inf. Technol., vol. 1, no. 2, pp. 171–186, 2004.
  • M. R. Lyu, Ed., Handbook of Software Reliability Engineering. McGraw-Hill, 1996.
  • IEEE, "IEEE Std 610.12-1990." IEEE Standards Board, New York, USA, 28-Sep-1990.
  • K. Vaidyanathan and K. S. Trivedi, "A comprehensive model for software rejuvenation," IEEE Transactions on Dependable and Secure Computing, vol. 2, no. 2, pp. 124–137, Jun. 2005.
  • L. Sha, "Using Simplicity to Control Complexity," IEEE Software, vol. 18(4), pp. 20–28, 2001.
  • A. Avižienis, "The N-Version Approach to Fault-Tolerant Software," IEEE Transactions on Software Engineering, vol. SE-11, no. 12, pp. 1491–1501, Dec. 1985.
  • G. Candea, J. Cutler, and A. Fox, "Improving Availability with Recursive Microreboots: A Soft-State System Case Study," Performance Evaluation Journal, vol. 56, no. 1–3, Mar. 2004.
  • A. Wood, "Software Reliability Growth Models," Tandem Computers, Cupertino, CA, Sep. 1996.
  • J. D. Musa and K. Okumoto, "Application of Basic and Logarithmic Poisson Execution Time Models in Software Reliability Measurement," in Software Reliability Modelling and Identification, 1988, pp. 68–100.
  • A. Benso, S. Chiusano, P. Prinetto, and L. Tagliaferri, "A C/C++ source-to-source compiler for dependable applications," in Dependable Systems and Networks (DSN), 2000.
  • T. J. Shimeall and N. G. Leveson, "An empirical comparison of software fault tolerance and fault elimination," in Second Workshop on Software Testing, Verification, and Analysis, 1988.
  • D. Hoffman, "A Taxonomy for Test Oracles." Quality Week 1998, 30-Mar-1998.
  • J. D. Musa, "Software reliability-engineered testing," Computer, vol. 29, no. 11, Nov. 1996.

Distributed Systems

  • L. Lamport, "Time, clocks, and the ordering of events in a distributed system," Commun. ACM, vol. 21, no. 7, pp. 558–565, 1978.
  • J. Gray, "Why do computers stop and what can be done about it?," in Symposium on Reliability in Distributed Software and Database Systems (SRDS-5), 1986, pp. 3–12.
  • P. Jalote, Fault Tolerance in Distributed Systems. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1994.
  • L. Lamport, R. Shostak, and M. Pease, "The Byzantine Generals Problem," Transactions on Programming Languages and Systems (TOPLAS), vol. 3, pp. 382–401, 1982.
  • M. J. Fischer, N. A. Lynch, and M. S. Paterson, "Impossibility of distributed consensus with one faulty process," J. ACM, vol. 32, no. 2, pp. 374–382, 1985.
  • J.-C. Laprie and K. Kanoun, "Software Reliability and System Reliability," in Handbook of software reliability engineering, M. R. Lyu, Ed. McGraw-Hill, 1996, pp. 27–69.
  • J. P. Hansen and D. P. Siewiorek, "Models for time coalescence in event logs," in International Symposium on Fault-Tolerant Computing (FTCS-22), 1992, pp. 221–227.
  • P. Lorczak, A. Caglayan, and D. Eckhardt, "A theoretical investigation of generalised voters," in IEEE 19th Annual International Symposium on Fault-Tolerant Computing Systems (FTCS’19), pp. 444–451.
  • K. M. Chandy and L. Lamport, "Distributed snapshots: determining global states of distributed systems," ACM Trans. Comput. Syst., vol. 3, no. 1, pp. 63–75, 1985.
  • G. Coulouris, J. Dollimore, and T. Kindberg, Distributed Systems - Concepts and Design, 4. Edition. Addison Wesley, 2005.
  • A. S. Tanenbaum and M. van Steen, Distributed Systems - Principles and Paradigms. Prentice-Hall Inc., 2002.
  • G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, "Dynamo: amazon’s highly available key-value store," SIGOPS Oper. Syst. Rev., vol. 41, no. 6, pp. 205–220, Oct. 2007. [Link]
  • S. C. Kendall, J. Waldo, A. Wollrath, and G. Wyant, "A Note on Distributed Computing," Sun Microsystems, Inc., Mountain View, CA, USA, 1994.
  • F. Cristian, "Understanding Fault-Tolerant Distributed Systems," Communications of the ACM, vol. 34, no. 2, pp. 56–78, 1991.