Non-Uniform Memory Access (NUMA) Seminar (2014)

Prof. Dr. Andreas Polze

Felix Eberhardt, M.Sc.
Frank Feinbube, M.Sc.

Organization

Extent: 2 semester hours (3 graded credit points)

Dates: Wednesday, 11.00 - 12.30, HS3

The seminar focusses on literature review. Participants are required to read fundamental scientific publications on non-uniform memory access (NUMA) systems, and present them to their fellow students.
  • Each participant is expected to give a 30-45 minute presentation on a topic.
  • Presentation slides should be discussed with a supervisor one week prior to the presentation date.
  • At the end of the seminar, we plan to assemble a technical report about your seminar topics.

Topics

1. NUMA system architecture

1.1 Multiprocessor architectures (historic, current): AMD, Intel, IBM, Sparc
1.2 Interconnection technologies
1.3 Cache coherency

2. Operating systems

2.1 Scientific approaches: Thread and data placement
2.2 Topology discovery
2.3 Kernel APIs for thread and data placement

3. Programming models

3.1 NUMA-aware algorithms
3.2 OpenMP
3.3 OpenMPI
3.4 NUMA-aware hybrid computing with OpenCL/CUDA
3.5 NUMA support in high level programming languages (Java, Python, C#, ...)
3.6 PGAS (Unified Parallel C, Coarray Fortran, Fortress, Chapel, X10, and Global Arrays)
3.7 C++11 Memory consitency protocol

4. Profiling

4.1 Performance counter
4.2 Instruction based sampling
4.3 Scientific approaches: Profilers/analyzing runtime behaviour
4.4 Profiling tools: Intel, AMD, IBM, Sun/Sparc

5. Case study: Linux NUMA balance evolution

We are open for any topic suggestions. Each of the following proposed topics may be worked on by one or two students.

Presentation Dates

According to your prioritized lists of topics the seminar schedule is as follows:

Date Topic Presenter Topic discussion Presentation review
15.10.2014 Introduction Felix Eberhardt - -
22.10.2014 Topic assignment Felix Eberhardt - -
29.10.2014 FutureSOC Lab Day - - -
05.11.2014 tba Alexander Böhm (SAP) - -
12.11.2014 Introduction to FutureSOC Lab Felix Eberhardt - -
19.11.2014 Multiprocessor architectures Kirstin Heidler 05.11.2014 12.11.2014
Cache coherency Johannes Frohnhofen 05.11.2014 12.11.2014
26.11.2014 Interconnection technologies Elina Zarisheva 12.11.2014 24.11.2014
Scientific approaches: Thread and data placement Fabian Eckert 12.11.2014 24.11.2014
03.12.2014 Case study: Linux NUMA evolution Fredrik Teschke, Lukas Pirl 06.11.2014 26.11.2014
Kernel APIs for thread and data placement Dimitri Korsch 19.11.2014 26.11.2014
10.12.2014 Scientific approaches: NUMA Profilers/analyzing runtime behaviour Malte Swart 24.11.2014 04.12.2014
Performance Counter Karsten Tausche 24.11.2014 04.12.2014
17.12.2014 Topology discovery Sven Knebel 01.12.2014 10.12.2014
NUMA in high level programming languages Patrick Siegler 01.12.2014 10.12.2014
07.01.2015 NUMA with OpenCL Jan Philipp Sachse 25.11.2014 16.12.2014
C++11: Memory consistency Sebastian Gerstenberg 17.12.2014 28.12.2014
14.01.2015 NUMA with OpenMP Matthias Springer 17.12.2014 12.01.2015
NUMA with OpenMPI Carolin Fiedler 17.12.2014 12.01.2015
21.01.2015 NUMA-aware algorithms (matrix multiplication) Max Reimann, Philipp Otto 24.11.2014 19.01.2015
28.01.2015 NUMA-aware algorithms (reader-writer locks) Tom Herold, Marco Lamina 03.12.2014 21.01.2015
04.02.2015 NUMA-aware algorithms (SURF) Christoph Sterz, Patrick Schmidt 19.12.2014 02.02.2015

Start Literature

  • Molka, Daniel, et al. "Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system." Parallel Architectures and Compilation Techniques, 2009. PACT'09. 18th International Conference on. IEEE, 2009.
  • Majo, Zoltan, and Thomas R. Gross. "Memory System Performance in a NUMA Multicore Multiprocessor." (2011).
  • Antony, Joseph, Pete P. Janes, and Alistair P. Rendell. "Exploring thread and memory placement on NUMA architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport." High Performance Computing-HiPC 2006. Springer Berlin Heidelberg, 2006. 338-352.
  • LaRowe Jr, Richard P., and Carla Schlatter Ellis. "Experimental comparison of memory management policies for NUMA multiprocessors." ACM Transactions on Computer Systems (TOCS) 9.4 (1991): 319-363.
  • Su, ChunYi, et al. "Critical path-based thread placement for NUMA systems." ACM SIGMETRICS Performance Evaluation Review 40.2 (2012): 106-112.
  • Fowler, Rob, Anirban Mandal, and Min Yeol Lim. "Performance Consistency on Multi-socket AMD Opteron Systems." (2008).
  • Löf, Henrik, and Sverker Holmgren. "affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system." Proceedings of the 19th annual international conference on Supercomputing. ACM, 2005.
  • Blagodurov, Sergey, et al. "A case for NUMA-aware contention management on multicore systems." Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM, 2010.
  • Majo, Zoltan, and Thomas R. Gross. "Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead." ACM SIGPLAN Notices. Vol. 46. No. 11. ACM, 2011.
  • Dashti, Mohammad, et al. "Traffic management: a holistic approach to memory placement on NUMA systems." ACM SIGPLAN Notices 48.4 (2013): 381-394.
  • Li, Yinan, et al. "NUMA-aware algorithms: the case of data shuffling." CIDR. 2013.
  • Majo, Zoltan, and Thomas R. Gross. "A template library to integrate thread scheduling and locality management for NUMA multiprocessors." Proc. of the 4th USENIX conference on Hot Topics in Parallelism (HotPar). 2012.
  • Broquedis, François, et al. "ForestGOMP: an efficient OpenMP environment for NUMA architectures." International Journal of Parallel Programming 38.5-6 (2010): 418-439.
  • Broquedis, François, et al. "Dynamic task and data placement over NUMA architectures: an OpenMP runtime perspective." Evolving OpenMP in an Age of Extreme Parallelism. Springer Berlin Heidelberg, 2009. 79-92.
  • Olivier, Stephen L., et al. "OpenMP task scheduling strategies for multicore NUMA systems." International Journal of High Performance Computing Applications (2012): 1094342011434065.
  • Durand, Marie, et al. "An efficient openmp loop scheduler for irregular applications on large-scale numa machines." OpenMP in the Era of Low Power Devices and Accelerators. Springer Berlin Heidelberg, 2013. 141-155.
  • Li, Shigang, Torsten Hoefler, and Marc Snir. "NUMA-aware shared-memory collective communication for MPI." Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 2013.
  • Marathe, Jaydeep, and Frank Mueller. "Hardware profile-guided automatic page placement for ccNUMA systems." Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 2006.
  • Zaparanuks, Dmitrijs, Milan Jovic, and Matthias Hauswirth. "Accuracy of performance counter measurements." Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 2009.
  • McCurdy, Collin, and Jeffrey Vetter. "Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms." Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on. IEEE, 2010.
  • Lachaize, Renaud, Baptiste Lepers, and Vivien Quéma. "MemProf: A Memory Profiler for NUMA Multicore Systems." USENIX Annual Technical Conference. 2012.
  • http://thread.gmane.org/gmane.linux.kernel/1699668
  • http://lwn.net/Articles/486858/