Non-Uniform Memory Access (NUMA) Seminar (2014)

Prof. Dr. Andreas Polze

Felix Eberhardt, M.Sc.
Frank Feinbube, M.Sc.

Organization

Extent: 2 semester hours (3 graded credit points)

Dates: Wednesday, 11.00 - 12.30, HS3

The seminar focusses on literature review. Participants are required to read fundamental scientific publications on non-uniform memory access (NUMA) systems, and present them to their fellow students.

Each participant is expected to give a 30-45 minute presentation on a topic.
Presentation slides should be discussed with a supervisor one week prior to the presentation date.
At the end of the seminar, we plan to assemble a technical report about your seminar topics.

Topics

1. NUMA system architecture

1.1 Multiprocessor architectures (historic, current): AMD, Intel, IBM, Sparc
1.2 Interconnection technologies
1.3 Cache coherency

2. Operating systems

2.1 Scientific approaches: Thread and data placement
2.2 Topology discovery
2.3 Kernel APIs for thread and data placement

3. Programming models

3.1 NUMA-aware algorithms
3.2 OpenMP
3.3 OpenMPI
3.4 NUMA-aware hybrid computing with OpenCL/CUDA
3.5 NUMA support in high level programming languages (Java, Python, C#, ...)
3.6 PGAS (Unified Parallel C, Coarray Fortran, Fortress, Chapel, X10, and Global Arrays)
3.7 C++11 Memory consitency protocol

4. Profiling

4.1 Performance counter
4.2 Instruction based sampling
4.3 Scientific approaches: Profilers/analyzing runtime behaviour
4.4 Profiling tools: Intel, AMD, IBM, Sun/Sparc

5. Case study: Linux NUMA balance evolution

We are open for any topic suggestions. Each of the following proposed topics may be worked on by one or two students.

Presentation Dates

According to your prioritized lists of topics the seminar schedule is as follows:

Date	Topic	Presenter	Topic discussion	Presentation review
15.10.2014	Introduction	Felix Eberhardt	-	-
22.10.2014	Topic assignment	Felix Eberhardt	-	-
29.10.2014	FutureSOC Lab Day	-	-	-
05.11.2014	tba	Alexander Böhm (SAP)	-	-
12.11.2014	Introduction to FutureSOC Lab	Felix Eberhardt	-	-
19.11.2014	Multiprocessor architectures	Kirstin Heidler	05.11.2014	12.11.2014
	Cache coherency	Johannes Frohnhofen	05.11.2014	12.11.2014
26.11.2014	Interconnection technologies	Elina Zarisheva	12.11.2014	24.11.2014
	Scientific approaches: Thread and data placement	Fabian Eckert	12.11.2014	24.11.2014
03.12.2014	Case study: Linux NUMA evolution	Fredrik Teschke, Lukas Pirl	06.11.2014	26.11.2014
	Kernel APIs for thread and data placement	Dimitri Korsch	19.11.2014	26.11.2014
10.12.2014	Scientific approaches: NUMA Profilers/analyzing runtime behaviour	Malte Swart	24.11.2014	04.12.2014
	Performance Counter	Karsten Tausche	24.11.2014	04.12.2014
17.12.2014	Topology discovery	Sven Knebel	01.12.2014	10.12.2014
	NUMA in high level programming languages	Patrick Siegler	01.12.2014	10.12.2014
07.01.2015	NUMA with OpenCL	Jan Philipp Sachse	25.11.2014	16.12.2014
	C++11: Memory consistency	Sebastian Gerstenberg	17.12.2014	28.12.2014
14.01.2015	NUMA with OpenMP	Matthias Springer	17.12.2014	12.01.2015
	NUMA with OpenMPI	Carolin Fiedler	17.12.2014	12.01.2015
21.01.2015	NUMA-aware algorithms (matrix multiplication)	Max Reimann, Philipp Otto	24.11.2014	19.01.2015
28.01.2015	NUMA-aware algorithms (reader-writer locks)	Tom Herold, Marco Lamina	03.12.2014	21.01.2015
04.02.2015	NUMA-aware algorithms (SURF)	Christoph Sterz, Patrick Schmidt	19.12.2014	02.02.2015

Start Literature

Molka, Daniel, et al. "Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system." Parallel Architectures and Compilation Techniques, 2009. PACT'09. 18th International Conference on. IEEE, 2009.
Majo, Zoltan, and Thomas R. Gross. "Memory System Performance in a NUMA Multicore Multiprocessor." (2011).
Antony, Joseph, Pete P. Janes, and Alistair P. Rendell. "Exploring thread and memory placement on NUMA architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport." High Performance Computing-HiPC 2006. Springer Berlin Heidelberg, 2006. 338-352.
LaRowe Jr, Richard P., and Carla Schlatter Ellis. "Experimental comparison of memory management policies for NUMA multiprocessors." ACM Transactions on Computer Systems (TOCS) 9.4 (1991): 319-363.
Su, ChunYi, et al. "Critical path-based thread placement for NUMA systems." ACM SIGMETRICS Performance Evaluation Review 40.2 (2012): 106-112.
Fowler, Rob, Anirban Mandal, and Min Yeol Lim. "Performance Consistency on Multi-socket AMD Opteron Systems." (2008).
Löf, Henrik, and Sverker Holmgren. "affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system." Proceedings of the 19th annual international conference on Supercomputing. ACM, 2005.
Blagodurov, Sergey, et al. "A case for NUMA-aware contention management on multicore systems." Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM, 2010.
Majo, Zoltan, and Thomas R. Gross. "Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead." ACM SIGPLAN Notices. Vol. 46. No. 11. ACM, 2011.
Dashti, Mohammad, et al. "Traffic management: a holistic approach to memory placement on NUMA systems." ACM SIGPLAN Notices 48.4 (2013): 381-394.
Li, Yinan, et al. "NUMA-aware algorithms: the case of data shuffling." CIDR. 2013.
Majo, Zoltan, and Thomas R. Gross. "A template library to integrate thread scheduling and locality management for NUMA multiprocessors." Proc. of the 4th USENIX conference on Hot Topics in Parallelism (HotPar). 2012.
Broquedis, François, et al. "ForestGOMP: an efficient OpenMP environment for NUMA architectures." International Journal of Parallel Programming 38.5-6 (2010): 418-439.
Broquedis, François, et al. "Dynamic task and data placement over NUMA architectures: an OpenMP runtime perspective." Evolving OpenMP in an Age of Extreme Parallelism. Springer Berlin Heidelberg, 2009. 79-92.
Olivier, Stephen L., et al. "OpenMP task scheduling strategies for multicore NUMA systems." International Journal of High Performance Computing Applications (2012): 1094342011434065.
Durand, Marie, et al. "An efficient openmp loop scheduler for irregular applications on large-scale numa machines." OpenMP in the Era of Low Power Devices and Accelerators. Springer Berlin Heidelberg, 2013. 141-155.
Li, Shigang, Torsten Hoefler, and Marc Snir. "NUMA-aware shared-memory collective communication for MPI." Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 2013.
Marathe, Jaydeep, and Frank Mueller. "Hardware profile-guided automatic page placement for ccNUMA systems." Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 2006.
Zaparanuks, Dmitrijs, Milan Jovic, and Matthias Hauswirth. "Accuracy of performance counter measurements." Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 2009.
McCurdy, Collin, and Jeffrey Vetter. "Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms." Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on. IEEE, 2010.
Lachaize, Renaud, Baptiste Lepers, and Vivien Quéma. "MemProf: A Memory Profiler for NUMA Multicore Systems." USENIX Annual Technical Conference. 2012.
http://thread.gmane.org/gmane.linux.kernel/1699668
http://lwn.net/Articles/486858/