Dependable Systems (2011)

Summer 2011

Dr. Peter Tröger

Oral exams: 22.8.-24.8. , 19.9.-21.9.

Description

Continous service provisioning is a key feature of modern hardware und software server systems. These systems achieve their level of user-perceived availability through a set of formal and technical approaches, commonly summarized under the term dependability.

Dependability is defined as the trustworthiness of hardware and software systems, so that reliance can be placed on the service they provide. The main dependability attributes commonly known and accepted are availability, reliability, safety, and security.

The Dependable Systems course gives an introduction into theoretical foundations, common building blocks and example implementations for dependable IT components and systems. The focus is on reliability and availability aspects of dependable systems, such as reliability analysis, fault tolerance, fault models or failure prediction. Amongst other things, the following topics are covered:

Dependability definitions and metrics
Design patterns for fault tolerance
Analytical evaluation of system dependability
Hardware dependability approaches
Software dependability approaches
Latest research topics

Regularities

Students taking this course need to have basic knowledge in operating systems and middleware technology. On request of at least one participant, the course will be given in English. The course contains of two modules: Lectures and assignments. The pass-grading for 2 out of 3 assignments is the mandatory precondition for taking the oral exam. The final course grade is the oral exam grade.

Dates

Lectures: Tue, 13:30 - 15:00 (HS3) / Wed, 13:30 - 15:00 (HS1)

Slides

Assignments

The solution submission system demands authentication with either a valid DFN or HPI certificate. The later can be automatically generated at the HPI Certificate Authority (CA). You need to use your HPI account credentials to enter the page. The generated user certificate is automatically installed in your browser - make sure that you export and re-import it if you want to access the submission system from another browser product.

Recommended Readings

General Information

J. C. Laprie, A. Avizienis, and H. Kopetz, Eds., Dependability: Basic Concepts and Terminology. Secaucus, NJ, USA: Springer-Verlag, 1992.
D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems, Third. Wellesley, MA: A. K. Peters, Ltd., 1998.
W. R. Dunn, "Designing Safety-Critical Computer Systems," Computer, vol. 36, pp. 40–46, 2003.
G. F. Pfister, "High Availability," in In Search of Clusters, pp. 379–452.
D. K. Pradhan, Ed., Fault-tolerant computer system design. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1996.
R. Hanmer, Patterns for Fault Tolerant Software, 1st ed. John Wiley Sons, 2007.
R. G. Johnston, "Being Vulnerable to the Threat of Confusing Threats with Vulnerabilities," Journal of Physical Security, vol. 4, no. 2, pp. 30–34, 2010.
J. Boner, "Scalability, Availability Stability Patterns," 11-May-2010. [Online]. Available: http://www.slideshare.net/jboner/scalability-availability-stability-patterns. [Link]
O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, and M. Violante, Software-Implemented Hardware Fault Tolerance, 1st ed. Springer, 2010.
A. Avižienis, "Design of fault-tolerant computers," in Fall joint computer conference (AFIPS), 1967, pp. 733–743. [Link]

Case Studies

B. Schroeder, E. Pinheiro, and W.-D. Weber, "DRAM errors in the wild: a large-scale field study," in SIGMETRICS ’09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, 2009, pp. 193–204.
Y. C. Yeh, "Triple-Triple Redundant 777 Primary Flight Computer," in IEEE Aerospace Applications Conference, 1996, pp. 293–307.
T. Durkin, "What the Media Couldn’t Tell You About Mars Pathfinder," Robot Science Technology, no. 1, 1998.
L. A. Keith, "Advisory Circular - System Design and Analysis." U.S. Department of Transportation, 21-Jun-1988.
G. Ramohalli, "The Honeywell on-board diagnostic and maintenance system for the Boeing 777," in IEEE/AIAA 11th Digital Avionics Systems Conference, 1992, pp. 485–490.
Y. C. Bob Yeh, "Design Considerations in Boeing 777 Fly-By-Wire Computers," High-Assurance Systems Engineering, IEEE International Symposium on, vol. 0, 1998.
R. Hess, "Computing platform architectures for robust operation in the presence of lightning and other electromagnetic threats," in 16th Digital Avionics Systems Conference (DASC), 1997.
D. Bernick, B. Bruckert, P. Del Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen, "NonStop Advanced Architecture," in International Conference on Dependable Systems and Networks (DSN), 2005, vol. 0, pp. 12–21.
W. Bartlett and L. Spainhower, "Commercial Fault Tolerance: A Tale of Two Systems," IEEE Trans. Dependable Secur. Comput., vol. 1, no. 1, pp. 87–96, 2004.
B. Schroeder and G. A Gibson, "Understanding failures in petascale computers," Journal of Physics: Conference Series, vol. 78, no. 1, 2007.
E. B. Nightingale, J. R. Douceur, and V. Orgovan, "Cycles, cells and platters: An empirical analysis of hardware failures on a million consumer PCs," in Sixth conference on Computer systems (EuroSys), 2011, pp. 343–356. [Link]
S. Burger, O. Hummel, and M. Heinisch, "Airbus Cabin Software," IEEE Software, vol. 30, no. 1, pp. 21–25, 2013.
M. W. Winter, Software Fault Tree Analysis of an Automated Control System Device Written in Ada. Naval Postgraduate School, 1995. [Link]

Analytical Evaluation

D. A. Menasce and V. A.F. Almeida, Capacity Planning for Web Services: Metrics, Models, and Methods. Prentice Hall, 2002.
A. Sathaye, S. Ramani, and K. S. Trivedi, "Availability Models in Practice." . [Link]
J. F. Meyer, "On Evaluating the Performability of Degradable Computing Systems," IEEE Trans. Comput., vol. 29, no. 8, pp. 720–731, Aug. 1980.
"Fehlermöglichkeits- und Einflussanalyse (FMEA) nach QS-9000." [Link]
W. E. Weseley, F. F. Goldberg, N. H. Roberts, and D. F. Haasl, Fault Tree Handbook. Washington, D.C.: US Nuclear Regulatory Commission, 1981.
D. W. Vesley, D. J. Dugan, J. Fragole, J. Minarik II, and J. Railsback, "Fault tree handbook with aerospace applications," NASA Office of Safety and Mission Assurance, NASA Headquarters, Washington DC, vol. 20546, 2002.
NASA Scientific Program Information, "Fault Tree Analysis," National Aeronautics and Space Administration, Jul. 2000.
M. Malhotra and K. S. Trivedi, "Dependability modeling using Petri-nets," Reliability, IEEE Transactions on, vol. 44, pp. 428–440, Sep. 1995.
J. Bechta Dugan, S. J. Bavuso, and M. A. Boyd, "Fault trees and sequence dependencies," in Reliability and Maintainability Symposium (RAMS), pp. 286–293.
R. La Band and J.D. Andrews, "Phased mission modelling using fault tree analysis," in Proceedings of the Institution of Mechanical Engineers, 2004.
N. Limnios, Fault Trees, vol. 1. Great Britain: ISTE, 2007.
M. Malhotra and K. S. Trivedi, "Power-hierarchy of dependability-model types," IEEE Transactions on Reliability, vol. 43, no. 3, pp. 493–502, Sep. 1994.

Failure Prediction

F. Salfner, P. Tröger, and S. Tschirpke, "Cross-Core Event Monitoring For Processor Failure Prediction," in The International Conference on High Performance Computing Simulation, Workshop on Dependable Multi-Core Computing (DMCC), 2009, pp. 67–73.
J. Carreira, H. Madeira, and J. G. Silva, "Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers," IEEE Transactions on Software Engineering, vol. 24, no. 2, pp. 125–136, 1998.
H. Ziade, R. A. Ayoubi, and R. Velazco, "A Survey on Fault Injection Techniques," Int. Arab J. Inf. Technol., vol. 1, no. 2, pp. 171–186, 2004.
M. R. Lyu, Ed., Handbook of Software Reliability Engineering. McGraw-Hill, 1996.
IEEE, "IEEE Std 610.12-1990." IEEE Standards Board, New York, USA, 28-Sep-1990.
K. Vaidyanathan and K. S. Trivedi, "A comprehensive model for software rejuvenation," IEEE Transactions on Dependable and Secure Computing, vol. 2, no. 2, pp. 124–137, Jun. 2005.
L. Sha, "Using Simplicity to Control Complexity," IEEE Software, vol. 18(4), pp. 20–28, 2001.
A. Avižienis, "The N-Version Approach to Fault-Tolerant Software," IEEE Transactions on Software Engineering, vol. SE-11, no. 12, pp. 1491–1501, Dec. 1985.
G. Candea, J. Cutler, and A. Fox, "Improving Availability with Recursive Microreboots: A Soft-State System Case Study," Performance Evaluation Journal, vol. 56, no. 1–3, Mar. 2004.
A. Wood, "Software Reliability Growth Models," Tandem Computers, Cupertino, CA, Sep. 1996.
J. D. Musa and K. Okumoto, "Application of Basic and Logarithmic Poisson Execution Time Models in Software Reliability Measurement," in Software Reliability Modelling and Identification, 1988, pp. 68–100.
A. Benso, S. Chiusano, P. Prinetto, and L. Tagliaferri, "A C/C++ source-to-source compiler for dependable applications," in Dependable Systems and Networks (DSN), 2000.
T. J. Shimeall and N. G. Leveson, "An empirical comparison of software fault tolerance and fault elimination," in Second Workshop on Software Testing, Verification, and Analysis, 1988.
D. Hoffman, "A Taxonomy for Test Oracles." Quality Week 1998, 30-Mar-1998.
J. D. Musa, "Software reliability-engineered testing," Computer, vol. 29, no. 11, Nov. 1996.

Distributed Systems

L. Lamport, "Time, clocks, and the ordering of events in a distributed system," Commun. ACM, vol. 21, no. 7, pp. 558–565, 1978.
J. Gray, "Why do computers stop and what can be done about it?," in Symposium on Reliability in Distributed Software and Database Systems (SRDS-5), 1986, pp. 3–12.
P. Jalote, Fault Tolerance in Distributed Systems. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1994.
L. Lamport, R. Shostak, and M. Pease, "The Byzantine Generals Problem," Transactions on Programming Languages and Systems (TOPLAS), vol. 3, pp. 382–401, 1982.
M. J. Fischer, N. A. Lynch, and M. S. Paterson, "Impossibility of distributed consensus with one faulty process," J. ACM, vol. 32, no. 2, pp. 374–382, 1985.
J.-C. Laprie and K. Kanoun, "Software Reliability and System Reliability," in Handbook of software reliability engineering, M. R. Lyu, Ed. McGraw-Hill, 1996, pp. 27–69.
J. P. Hansen and D. P. Siewiorek, "Models for time coalescence in event logs," in International Symposium on Fault-Tolerant Computing (FTCS-22), 1992, pp. 221–227.
P. Lorczak, A. Caglayan, and D. Eckhardt, "A theoretical investigation of generalised voters," in IEEE 19th Annual International Symposium on Fault-Tolerant Computing Systems (FTCS’19), pp. 444–451.
K. M. Chandy and L. Lamport, "Distributed snapshots: determining global states of distributed systems," ACM Trans. Comput. Syst., vol. 3, no. 1, pp. 63–75, 1985.
G. Coulouris, J. Dollimore, and T. Kindberg, Distributed Systems - Concepts and Design, 4. Edition. Addison Wesley, 2005.
A. S. Tanenbaum and M. van Steen, Distributed Systems - Principles and Paradigms. Prentice-Hall Inc., 2002.
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, "Dynamo: amazon’s highly available key-value store," SIGOPS Oper. Syst. Rev., vol. 41, no. 6, pp. 205–220, Oct. 2007. [Link]
S. C. Kendall, J. Waldo, A. Wollrath, and G. Wyant, "A Note on Distributed Computing," Sun Microsystems, Inc., Mountain View, CA, USA, 1994.
F. Cristian, "Understanding Fault-Tolerant Distributed Systems," Communications of the ACM, vol. 34, no. 2, pp. 56–78, 1991.