Large-Scale Data Analysis on Cloud Platforms (2010)

Prof. Dr. Felix Naumann, Christoph Böhm, Johannes Lore, Dr. Peter Tröger

Summer Semester 2010


The seminar is a collaborative effort of the Information Systems group and the Operating Systems and Middleware group at HPI. The primary course page contains all relevant information regarding course regulations and general concept.

Participating students choose between a data analysis topic (as described here) or a cloud infrastructure topic. While the data analysis student teams rely on a predefined infrastructure (Amazon Cloud / Hadoop), the infrastructure teams have the task of evaluating alternative cloud environments. Possible topics are:

  • VMWare
    • Utilize VSphere 4 / VCloud for private cloud installation
    • Configure / test Hadoop on this infrastructure
  • Eucalyptus
    • Private cloud infrastructure based on XEN / KVM
    • Offers Amazon EC2 interfaces
    • Evaluate Map/Reduce options
  • Lokad
    • Running MapReduce on Azure
  • Nimbus
    • Private cloud based on SGE / PBS cluster
    • Offers Amazon EC2 interface
    • Evaluate Map/Reduce options
  • OpenNebula
    • Open source solution for private clouds
    • Support for OCCI
    • Evaluate Map/Reduce options
  • AppScale
    • Run Google AppEngine applications on a private cloud of XEN / KVM installations
    • Evaluate Map/Reduce options with Google PaaS, e.g. HTTP Map/Reduce project
  • Cloud MapReduce
    • Alternative Map/Reduce implementation for Amazon Cloud
    • Direct comparison with standard solution, including custom benchmark design


  • Decide for a topic until April 26th, notify us by eMail
  • Learn resp. install cloud environment
  • Configure Map/Reduce execution environment
  • Document architectural properties
  • Analyze fault-tolerance capabilities
  • (Implement and) run Hadoop MapReduce sorting benchmark
  • Optional: Collaboration with one data analysis team
  • Submit exhaustive report about architectural properties and analysis results