Software Reliability Engineering 2019

Prof. Dr. Andreas Polze

Daniel Richter
Lukas Pirl
Lena Feinbube

Inhalt der Vorlesung

Ein Softwaresystem ist zuverlässig, wenn es seinen Dienst auf eine vertrauenswürdige Weise erbringt. Softwarezuverlässigkeit gewinnt mehr und mehr an Bedeutung, da Software in unserem Leben allgegenwärtig und gleichzeitig immer komplexer wird. Moderne Softwaresysteme gewinnen nicht nur an Größe, sondern auch an Komplexität durch zusätzliche Abstraktionsschichten, Interaktion mit verschiedenen Komponenten, Nebenläufigkeit und andere Quellen von Nicht-Determinismus.

Diese Vorlesung stellt den Stand der Technik im Bereich der Softwarezuverlässigkeit vor, mit einem starken praktischen Fokus auf Fallstudien und realitätsnahen, großtechnischen Softwaresystemen. Zu den behandelten Themen gehören:

  • Grundlagen
    • Dependability Threats, Dependability Attributes, Dependability Means
    • Software Fault, Timing, and Consistency Models
  • Fault Prevention
    • Processes for dependable software design
    • Development practices
  • Fault Tolerance
    • Patterns for fault tolerance and detection
    • Distributed systems: theory and applications
    • Fault tolerance in operating systems
  • Fault Removal and Forecasting
    • Formal methods
    • Testing and Debugging
  • Fault Forecasting
    • Fault injection
    • Dependability modelling and analysis
  • Discussion of Case Studies and Postmortems from Practice

Große Teile der Zeit verbringen wir mit praktischer Projektarbeit in kleinen Teams. Im Rahmen eines Programmierprojektes wird eine Anwendung fehlertolerant gemacht und anschließend durch Fehlerinjektion ausgewertet. Die praktische Projektarbeit wird als Wettbewerb konzipiert und erstreckt sich über das gesamte Semester.

Organisation

Umfang: 4 Semesterwochenstunden (6 benotete ECTS-Punkte)

Vorlesung/Projekt: Mittwochs, 11.00-12.30 Uhr in A-2.2, Donnerstags 13.30-15.00 Uhr in H-2.58

Module (SO2010): OSIS, SAMT, ISAE

Leistungserfassung: Die Vorlesung wird von einer Übung/einem Projekt begleitet, deren erfolgreiche Teilnahme Voraussetzung für die Prüfungszulassung ist. Die Leistungserfassung erfolgt Rahmen einer mündlichen Prüfung.

Übungen

Gruppe A Oliver, Leonard, Alexander, Pawel

https://sre18.pages.rechenknecht.net/smtp-server/pack/Dockerfile,
https://hub.docker.com/r/sre18groupa/smtp-server-group-a/

Gruppe B Johannes, Tobias, Tobias, Nils

https://owncloud.hpi.de/index.php/s/9uitNpBSCxaOLCt/download,
https://hub.docker.com/r/sregroupb/smtp-server/

Gruppe C Lawrence, Jan, Martin, Fabian

https://github.com/lawben/sre-smtp,
https://github.com/lawben/sre-smtp/blob/master/Dockerfile,
https://hub.docker.com/r/sregroupc/smtp-server/

Gruppe D Theresia, Patrick, Jonas

https://gitlab.hpi.de/patrick.jattke/sw-reliability-eng/raw/master/docker/dockerfile,
https://hub.docker.com/r/sregroupd/sre-smtp-group-d/

Gruppe E Johannes, Julian, Tobias

SMTP-Service
Storage-Service
Fault-Injection-Test-Suite
https://hub.docker.com/r/julianweise/simplesmtp/

Termine & Materialien

Mi, 11.April 2018 Einführungsveranstaltung (11.00 Uhr, Raum A-2.2)»Folien
Do, 12. April 2018 frei
Mi, 18. April 2018 Symposium on Future Trends in Service-Oriented Computing
Do, 19. April 2018 Symposium on Future Trends in Service-Oriented Computing
Mi, 25. April 2018 Dependability Fundamentals, Einteilung Projektgruppen»Folien
Do, 26. April 2018 Projektarbeit
Mi, 02. Mai 2018 Vorstellung Übung 1 »Gruppe A, »Gruppe B, »Gruppe C, »Gruppe D, »Gruppe E
Do, 03. Mai 2018 Projektarbeit
Mi, 09. Mai 2018 Fault Prevention, Fault Injection»Folien, »Folien
Do, 10. Mai 2018 Himmelfahrt
Mi, 16. Mai 2018 Vorstellung Übung 2 + 3»Gruppe A, »Gruppe B, »Gruppe C, »Gruppe D, »Gruppe E
Do, 17. Mai 2018 Projektarbeit
Mi, 23. Mai 2018 Fehlerinjektion (Tools)»Folien
Do, 24. Mai 2018 Projektarbeit
Mi, 30. Mai 2018 Fault Tolerance»Folien
Do, 31. Mai 2018 Projektarbeit
Mi, 06. Juni 2018 Vorstellung Übung 4+5 »Gruppe A, »Gruppe B, »Gruppe C, »Gruppe D, »Gruppe E
Do, 07. Juni 2018 Projektarbeit
Mi, 13. Juni 2018 Fault Tolerance»Folien, »Folien, »Folien
Do, 14. Juni 2018 Projektarbeit
Mi, 20. Juni 2018 Vorstellung Übung 7»Gruppe A, »Gruppe B, »Gruppe C, »Gruppe D, »Gruppe E
Do, 21. Juni 2018 Projektarbeit
Mi, 27. Juni 2018 Fault Tolerance in Distributed Systems»Folien
Do, 28. Juni 2018 Projektarbeit
Mi, 04. Juli 2018 Fault Removal, Human Factors, Site Reliability Engineering»Folien, »Folien, »Folien
Do, 05. Juli 2018 Fault Forecasting, Software Fault Injection Case Study on OpenStack»Folien, »Folien kurz lang
Mi, 11. Juli 2018 Endpräsentation/Vorstellung Übung 8+9»Gruppe A, »Gruppe B, »Gruppe C, »Gruppe D, »Gruppe E
Do, 12. Juli 2018 Fragestunde
Mi, 18. Juli 2018 QRS
Do, 19. Juli 2018 QRS

Literature

  • Nancy Leveson. Engineering a safer world - systems thinking applied to safety. 2011. URL: https://library.oapen.org/handle/20.500.12657/26043. [ BibTeX ]
    ×
    @misc{safetyeng,
    author = "Leveson, Nancy",
    title = "Engineering a safer world - systems thinking applied to safety.",
    year = "2011",
    url = "https://library.oapen.org/handle/20.500.12657/26043"
    }
  • Andrey Karpov and Evgeniy Ryzhkov. 100 bugs in open source c/c++ projects. 2012. URL: http://www.viva64.com/en/a/0079/#. [ BibTeX ]
    ×
    @misc{bugs100,
    author = "Karpov, Andrey and Ryzhkov, Evgeniy",
    title = "100 bugs in Open Source C/C++ projects",
    url = "http://www.viva64.com/en/a/0079/\#",
    year = "2012"
    }
  • George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando Fox. Microreboot-a technique for cheap recovery. In OSDI, volume 4, 31–44. 2004. [ BibTeX ]
    ×
    @inproceedings{candea2004microreboot,
    author = "Candea, George and Kawamoto, Shinichi and Fujiki, Yuichi and Friedman, Greg and Fox, Armando",
    title = "Microreboot-A Technique for Cheap Recovery.",
    booktitle = "OSDI",
    volume = "4",
    pages = "31--44",
    year = "2004"
    }
  • Ricky W Butler and George B Finelli. The infeasibility of quantifying the reliability of life-critical real-time software. IEEE Transactions on Software Engineering, 19(1):3–12, 1993. [ BibTeX ]
    ×
    @article{butler1993infeasibility,
    author = "Butler, Ricky W and Finelli, George B",
    title = "The infeasibility of quantifying the reliability of life-critical real-time software",
    journal = "IEEE Transactions on Software Engineering",
    volume = "19",
    number = "1",
    pages = "3--12",
    year = "1993",
    publisher = "IEEE"
    }
  • Michael Nygard. Release it!: design and deploy production-ready software. Pragmatic Bookshelf, 2007. [ BibTeX ]
    ×
    @book{nygard2007release,
    author = "Nygard, Michael",
    title = "Release it!: design and deploy production-ready software",
    year = "2007",
    publisher = "Pragmatic Bookshelf"
    }
  • John C Knight. Safety critical systems: challenges and directions. In Software Engineering, 2002. ICSE 2002. Proceedings of the 24rd International Conference on, 547–550. IEEE, 2002. [ BibTeX ]
    ×
    @inproceedings{knight2002safety,
    author = "Knight, John C",
    title = "Safety critical systems: challenges and directions",
    booktitle = "Software Engineering, 2002. ICSE 2002. Proceedings of the 24rd International Conference on",
    pages = "547--550",
    year = "2002",
    organization = "IEEE"
    }
  • Jim Gray. Why do computers stop and what can be done about it? In Symposium on reliability in distributed software and database systems, 3–12. Los Angeles, CA, USA, 1986. URL: http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf. [ BibTeX ]
    ×
    @inproceedings{gray_why_1986,
    author = "Gray, Jim",
    title = "Why do computers stop and what can be done about it?",
    url = "http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf",
    urldate = "2015-09-07",
    booktitle = "Symposium on reliability in distributed software and database systems",
    publisher = "Los Angeles, CA, USA",
    year = "1986",
    pages = "3--12"
    }
  • Algirdas Avižienis, Jean-Claude Laprie, Brian Randell, and others. Fundamental concepts of dependability. University of Newcastle upon Tyne, Computing Science Newcastle upon Tyne, UK, 2001. URL: https://pld.ttu.ee/IAF0530/16/avi1.pdf. [ BibTeX ]
    ×
    @book{avizienis_fundamental_2001,
    author = "Avižienis, Algirdas and Laprie, Jean-Claude and Randell, Brian and {others}",
    title = "Fundamental concepts of dependability",
    url = "https://pld.ttu.ee/IAF0530/16/avi1.pdf",
    urldate = "2016-06-05",
    publisher = "University of Newcastle upon Tyne, Computing Science Newcastle upon Tyne, UK",
    year = "2001",
    file = "[PDF] from irarticle.com:C\:\\Users\\lenah\\AppData\\Roaming\\Zotero\\Zotero\\Profiles\\r9k160g6.default\\zotero\\storage\\WMU6XMIU\\Avizienis et al. - 2001 - Fundamental concepts of dependability.pdf:application/pdf"
    }
  • Peter Tröger, Lena Feinbube, and Matthias Werner. WAP: What activates a bug? A refinement of the Laprie terminology model. In Software Reliability Engineering (ISSRE), 2015 IEEE 26th International Symposium on, 106–111. IEEE, 2015. URL: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7381804. [ BibTeX ]
    ×
    @inproceedings{troger_wap:_2015,
    author = {Tr\"oger, Peter and Feinbube, Lena and Werner, Matthias},
    title = "{WAP}: {What} activates a bug? {A} refinement of the {Laprie} terminology model",
    shorttitle = "{WAP}",
    url = "http://ieeexplore.ieee.org/xpls/abs\_all.jsp?arnumber=7381804",
    urldate = "2016-06-05",
    booktitle = "Software {Reliability} {Engineering} ({ISSRE}), 2015 {IEEE} 26th {International} {Symposium} on",
    publisher = "IEEE",
    year = "2015",
    pages = "106--111",
    file = "Snapshot:C\:\\Users\\lenah\\AppData\\Roaming\\Zotero\\Zotero\\Profiles\\r9k160g6.default\\zotero\\storage\\HFARBWJG\\login.html:text/html"
    }
  • Algirdas Avižienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. Basic concepts and taxonomy of dependable and secure computing. Dependable and Secure Computing, IEEE Transactions on, 1(1):11–33, 2004. URL: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1335465. [ BibTeX ]
    ×
    @article{avizienis_basic_2004,
    author = "Avižienis, Algirdas and Laprie, Jean-Claude and Randell, Brian and Landwehr, Carl",
    title = "Basic concepts and taxonomy of dependable and secure computing",
    volume = "1",
    url = "http://ieeexplore.ieee.org/xpls/abs\_all.jsp?arnumber=1335465",
    number = "1",
    urldate = "2015-09-07",
    journal = "Dependable and Secure Computing, IEEE Transactions on",
    year = "2004",
    pages = "11--33"
    }
  • Algirdas Avižienis, Jean-Claude Laprie, and Brian Randell. Dependability and its threats: a taxonomy. In Building the Information Society, pages 91–120. Springer, 2004. URL: http://link.springer.com/chapter/10.1007/978-1-4020-8157-6_13. [ BibTeX ]
    ×
    @incollection{avizienis_dependability_2004,
    author = "Avižienis, Algirdas and Laprie, Jean-Claude and Randell, Brian",
    title = "Dependability and its threats: a taxonomy",
    shorttitle = "Dependability and its threats",
    url = "http://link.springer.com/chapter/10.1007/978-1-4020-8157-6\_13",
    urldate = "2015-09-07",
    booktitle = "Building the {Information} {Society}",
    publisher = "Springer",
    year = "2004",
    pages = "91--120"
    }
  • Matthias Wiesmann, Fernando Pedone, André Schiper, Bettina Kemme, and Gustavo Alonso. Database replication techniques: A three parameter classification. In Reliable Distributed Systems, 2000. SRDS-2000. Proceedings The 19th IEEE Symposium on, 206–215. IEEE, 2000. URL: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=885408. [ BibTeX ]
    ×
    @inproceedings{wiesmann_database_2000,
    author = "Wiesmann, Matthias and Pedone, Fernando and Schiper, André and Kemme, Bettina and Alonso, Gustavo",
    title = "Database replication techniques: {A} three parameter classification",
    shorttitle = "Database replication techniques",
    url = "http://ieeexplore.ieee.org/xpls/abs\_all.jsp?arnumber=885408",
    urldate = "2015-09-07",
    booktitle = "Reliable {Distributed} {Systems}, 2000. {SRDS}-2000. {Proceedings} {The} 19th {IEEE} {Symposium} on",
    publisher = "IEEE",
    year = "2000",
    pages = "206--215"
    }
  • Joel F. Bartlett. A NonStop Kernel. In Proceedings of the Eighth ACM Symposium on Operating Systems Principles, SOSP '81, 22–29. New York, NY, USA, 1981. ACM. URL: http://doi.acm.org/10.1145/800216.806587, doi:10.1145/800216.806587. [ BibTeX ]
    ×
    @inproceedings{bartlett_nonstop_1981,
    author = "Bartlett, Joel F.",
    address = "New York, NY, USA",
    series = "{SOSP} '81",
    title = "A {NonStop} {Kernel}",
    isbn = "978-0-89791-062-0",
    url = "http://doi.acm.org/10.1145/800216.806587",
    doi = "10.1145/800216.806587",
    abstract = "The Tandem NonStop System is a fault-tolerant [1], expandable, and distributed computer system designed expressly for online transaction processing. This paper describes the key primitives of the kernel of the operating system. The first section describes the basic hardware building blocks and introduces their software analogs: processes and messages. Using these primitives, a mechanism that allows fault-tolerant resource access, the process-pair, is described. The paper concludes with some observations on this type of system structure and on actual use of the system.",
    urldate = "2015-09-08",
    booktitle = "Proceedings of the {Eighth} {ACM} {Symposium} on {Operating} {Systems} {Principles}",
    publisher = "ACM",
    year = "1981",
    pages = "22--29"
    }
  • Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U Jain, and Michael Stumm. Simple testing can prevent most critical failures: an analysis of production failures in distributed data-intensive systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 249–265. 2014. [ BibTeX ]
    ×
    @inproceedings{yuan2014simple,
    author = "Yuan, Ding and Luo, Yu and Zhuang, Xin and Rodrigues, Guilherme Renna and Zhao, Xu and Zhang, Yongle and Jain, Pranay U and Stumm, Michael",
    title = "Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems",
    booktitle = "11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14)",
    pages = "249--265",
    year = "2014"
    }
  • William E Vesely, Francine F Goldberg, Norman H Roberts, and David F Haasl. Fault tree handbook. Technical Report, DTIC Document, 1981. [ BibTeX ]
    ×
    @techreport{vesely1981fault,
    author = "Vesely, William E and Goldberg, Francine F and Roberts, Norman H and Haasl, David F",
    title = "Fault tree handbook",
    year = "1981",
    institution = "DTIC Document"
    }
  • Roberto Natella, Domenico Cotroneo, and Henrique S Madeira. Assessing dependability with software fault injection: a survey. ACM Computing Surveys (CSUR), 48(3):44, 2016. [ BibTeX ]
    ×
    @article{natella2016assessing,
    author = "Natella, Roberto and Cotroneo, Domenico and Madeira, Henrique S",
    title = "Assessing Dependability with Software Fault Injection: A Survey",
    journal = "ACM Computing Surveys (CSUR)",
    volume = "48",
    number = "3",
    pages = "44",
    year = "2016",
    publisher = "ACM"
    }
  • Ram Chillarege. Orthogonal defect classification. Handbook of Software Reliability Engineering, pages 359–399, 1996. [ BibTeX ]
    ×
    @article{chillarege1996orthogonal,
    author = "Chillarege, Ram",
    title = "Orthogonal defect classification",
    journal = "Handbook of Software Reliability Engineering",
    pages = "359--399",
    year = "1996"
    }
  • Frederick P Brooks Jr. The Mythical Man-Month: Essays on Software Engineering, Anniversary Edition, 2/E. Pearson Education India, 1995. [ BibTeX ]
    ×
    @book{brooks1995mythical,
    author = "Brooks Jr, Frederick P",
    title = "The Mythical Man-Month: Essays on Software Engineering, Anniversary Edition, 2/E",
    year = "1995",
    publisher = "Pearson Education India"
    }
  • Robert Hanmer. Patterns for fault tolerant software. John Wiley & Sons, 2013. [ BibTeX ]
    ×
    @book{hanmer2013patterns,
    author = "Hanmer, Robert",
    title = "Patterns for fault tolerant software",
    year = "2013",
    publisher = "John Wiley \\& Sons"
    }
  • Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, and others. seL4: Formal verification of an OS kernel. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, 207–220. ACM, 2009. URL: http://dl.acm.org/citation.cfm?id=1629596. [ BibTeX ]
    ×
    @inproceedings{klein_sel4:_2009,
    author = "Klein, Gerwin and Elphinstone, Kevin and Heiser, Gernot and Andronick, June and Cock, David and Derrin, Philip and Elkaduwe, Dhammika and Engelhardt, Kai and Kolanski, Rafal and Norrish, Michael and {others}",
    title = "{seL}4: {Formal} verification of an {OS} kernel",
    shorttitle = "{seL}4",
    url = "http://dl.acm.org/citation.cfm?id=1629596",
    urldate = "2015-09-07",
    booktitle = "Proceedings of the {ACM} {SIGOPS} 22nd symposium on {Operating} systems principles",
    publisher = "ACM",
    year = "2009",
    pages = "207--220"
    }
  • Algirdas Avizienis. The N-version approach to fault-tolerant software. IEEE Transactions on software engineering, pages 1491–1501, 1985. URL: http://www.computer.org/csdl/trans/ts/1985/12/01701972-abs.html. [ BibTeX ]
    ×
    @article{avizienis_n-version_1985,
    author = "Avizienis, Algirdas",
    title = "The {N}-version approach to fault-tolerant software",
    url = "http://www.computer.org/csdl/trans/ts/1985/12/01701972-abs.html",
    number = "12",
    urldate = "2015-09-07",
    journal = "IEEE Transactions on software engineering",
    year = "1985",
    pages = "1491--1501"
    }
  • Flaviu Cristian. Exception handling and software fault tolerance. Computers, IEEE Transactions on, 100(6):531–540, 1982. URL: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1676035. [ BibTeX ]
    ×
    @article{cristian_exception_1982,
    author = "Cristian, Flaviu",
    title = "Exception handling and software fault tolerance",
    volume = "100",
    url = "http://ieeexplore.ieee.org/xpls/abs\_all.jsp?arnumber=1676035",
    number = "6",
    urldate = "2015-09-07",
    journal = "Computers, IEEE Transactions on",
    year = "1982",
    pages = "531--540"
    }
  • Werner Vogels. Eventually consistent. Communications of the ACM, 52(1):40–44, 2009. URL: http://dl.acm.org/citation.cfm?id=1435432. [ BibTeX ]
    ×
    @article{vogels_eventually_2009,
    author = "Vogels, Werner",
    title = "Eventually consistent",
    volume = "52",
    url = "http://dl.acm.org/citation.cfm?id=1435432",
    number = "1",
    urldate = "2015-09-07",
    journal = "Communications of the ACM",
    year = "2009",
    pages = "40--44"
    }
  • Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon's highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, 205–220. ACM, 2007. URL: http://dl.acm.org/citation.cfm?id=1294281. [ BibTeX ]
    ×
    @inproceedings{decandia_dynamo:_2007,
    author = "DeCandia, Giuseppe and Hastorun, Deniz and Jampani, Madan and Kakulapati, Gunavardhan and Lakshman, Avinash and Pilchin, Alex and Sivasubramanian, Swaminathan and Vosshall, Peter and Vogels, Werner",
    title = "Dynamo: amazon's highly available key-value store",
    volume = "41",
    shorttitle = "Dynamo",
    url = "http://dl.acm.org/citation.cfm?id=1294281",
    urldate = "2015-09-07",
    booktitle = "{ACM} {SIGOPS} {Operating} {Systems} {Review}",
    publisher = "ACM",
    year = "2007",
    pages = "205--220"
    }
  • Eric Brewer and others. Lessons from giant-scale services. Internet Computing, IEEE, 5(4):46–55, 2001. URL: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=939450. [ BibTeX ]
    ×
    @article{brewer_lessons_2001,
    author = "Brewer, Eric and {others}",
    title = "Lessons from giant-scale services",
    volume = "5",
    url = "http://ieeexplore.ieee.org/xpls/abs\_all.jsp?arnumber=939450",
    number = "4",
    urldate = "2015-09-07",
    journal = "Internet Computing, IEEE",
    year = "2001",
    pages = "46--55"
    }
  • Michael Grottke, Rivalino Matias, and Kishor S Trivedi. The fundamentals of software aging. In IEEE Proceedings of Workshop on Software Aging and Rejuvenation, in conjunction with ISSRE. Seattle, WA. 2008. [ BibTeX ]
    ×
    @inproceedings{grottke2008fundamentals,
    author = "Grottke, Michael and Matias, Rivalino and Trivedi, Kishor S",
    title = "The fundamentals of software aging",
    booktitle = "IEEE Proceedings of Workshop on Software Aging and Rejuvenation, in conjunction with ISSRE. Seattle, WA",
    year = "2008"
    }