





































| CM2 – Connection<br>W. Daniel Hillis: The Connection Mar<br>1985 (MIT Press Series in Artificial In<br>ISBN 0-262-08157-1 | Machine<br>chine.<br>Itelligence)                                                                                                                                                     | CM2 at Computer Museum, Mountain View, CA                  |  |
|---------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|--|
| Hersteller:                                                                                                               | Thinking Machines Corporation, Cambridge,<br>Massachusetts                                                                                                                            |                                                            |  |
| Prozessoren:                                                                                                              | 65.536 PEs (1-Bit Prozess<br>Speicher je PE: 128 KB (r<br>Peak-Performance: 2.500<br>10.000 MFLOPS (Skalar,3<br>5.000 MFLOPS (Skalar,64                                               | soren)<br>naximal)<br>) MIPS (32-Bit Op.)<br>2Bit)<br>Bit) |  |
| Verbindungsnetzwerke:                                                                                                     | -globaler Hypercube<br>- 4-faches, rekonfiguriert                                                                                                                                     | oares Nachbarschaftsgitter                                 |  |
| Programmiersprachen:                                                                                                      | -CMLisp (ursprüngliche Variante)<br>- *Lisp (Common Lisp Erweiterung)<br>-C*(Erweiterung von C)<br>-CMFortran (Anlehnung an Fortran 90)<br>-C/Paris (C+Assembler Bibliotheksroutinen) |                                                            |  |
|                                                                                                                           | 20                                                                                                                                                                                    |                                                            |  |

| MasPar MP-1           |                                                                                                                                                                     |
|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Hersteller:           | MasPar Computer Corporation,<br>Sunnyvale, California                                                                                                               |
| Prozessoren:          | 16.384 PEs (4-Bit Prozessoren)<br>Spei-cher je PE: 64 KB (maximal)<br>Peak-Performance:<br>30.000 MIPS (32-Bit Op.)<br>1.500 MFLOPS (32-Bit)<br>600 MFLOPS (64-Bit) |
| Verbindungsnetzwerke: | 3-stufiger globaler crossbar switch (Router)<br>8-faches Nachbarschaftsgitter (unabh.)                                                                              |
| Programmiersprachen   | <ul> <li>MPL (Erweiterung von C)</li> <li>MPFortran (Anlehnung an Fortran 90</li> </ul>                                                                             |
|                       |                                                                                                                                                                     |
|                       | 21                                                                                                                                                                  |



| Distributed Arra                                                                                                                                                                          | y Processor (DAP 610)                                                                                                                                                                                                                                                                                                                                                |  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| The Distributed Array Proces<br>the world's first commercial<br>complete in 1972 and building                                                                                             | ssor (DAP) produced by <b>International Computers Limited</b> (ICL) was massively parallel computer. The original paper study was ng of the prototype began in 1974.                                                                                                                                                                                                 |  |
| The ICL DAP had 64x64 single bit processing elements (PEs) with 4096 bits of storage per PE.<br>It was attached to an ICL mainframe and could be used as normal memory. (from Wikipedia). |                                                                                                                                                                                                                                                                                                                                                                      |  |
| Early mainframe coprocessor                                                                                                                                                               |                                                                                                                                                                                                                                                                                                                                                                      |  |
|                                                                                                                                                                                           |                                                                                                                                                                                                                                                                                                                                                                      |  |
| Hersteller:                                                                                                                                                                               | Active Memory Technology (AMT), Reading, England                                                                                                                                                                                                                                                                                                                     |  |
| Hersteller:<br>Prozessoren:                                                                                                                                                               | Active Memory Technology (AMT), Reading, England<br>4.096 PEs (1-Bit Prozessoren + 8-Bit Koprozessoren)<br>Speicher je PE: 32 KB<br>Peak-Performance:<br>40.000 MIPS (1-Bit Op.)<br>20.000 MIPS (8-Bit Op.)<br>560 MFLOPS                                                                                                                                            |  |
| Hersteller:<br>Prozessoren:<br>Verbindungsnetzwerk:                                                                                                                                       | Active Memory Technology (AMT), Reading, England<br>4.096 PEs (1-Bit Prozessoren + 8-Bit Koprozessoren)<br>Speicher je PE: 32 KB<br>Peak-Performance:<br>40.000 MIPS (1-Bit Op.)<br>20.000 MIPS (8-Bit Op.)<br>560 MFLOPS<br>- 4-faches Nachbarschaftsgitter<br>- (kein globales Netzwerk)                                                                           |  |
| Hersteller:<br>Prozessoren:<br>Verbindungsnetzwerk:<br>Programmiersprache:                                                                                                                | Active Memory Technology (AMT), Reading, England<br>4.096 PEs (1-Bit Prozessoren + 8-Bit Koprozessoren)<br>Speicher je PE: 32 KB<br>Peak-Performance:<br>40.000 MIPS (1-Bit Op.)<br>20.000 MIPS (8-Bit Op.)<br>20.000 MIPS (8-Bit Op.)<br>560 MFLOPS<br>- 4-faches Nachbarschaftsgitter<br>- (kein globales Netzwerk)<br>- Fortran-Plus (in Anlehnung an Fortran 90) |  |









```
Data distribution in HPF

HPF$ PROCESSORS :: prc(5), chess_board(8, 8)

HPF$ PROCESSORS :: cnfg(-10:10, 5)

HPF$ PROCESSORS :: mach( NUMBER_OF_PROCESSORS() )

REAL :: a(1000), b(1000)

INTEGER :: c(1000, 1000, 1000), d( 1000, 1000, 1000)

HPF$ DISTRIBUTE (BLOCK) ONTO prc :: a

HPF$ DISTRIBUTE (BLOCK) ONTO prc :: b

HPF$ DISTRIBUTE (BLOCK(100), *, CYCLIC) ONTO cnfg :: c

HPF$ ALIGN (i,j,k) WITH d(k,j,i) :: c
```









## **OpenCL Execution Model** OpenCL Program: e Kernels Basic unit of executable code — similar to a C function Data-parallel or task-parallel Host Program • Collection of compute kernels and internal functions • Analogous to a dynamic library Kernel Execution The host program invokes a kernel over an index space called an 0 NDRange NDRange = "N-Dimensional Range" • NDRange can be a 1, 2, or 3-dimensional space • A single kernel instance at a point in the index space is called a work-item • Work-items have unique global IDs from the index space • Work-items are further grouped into work-groups Work-groups have a unique work-group ID • Work-items have a unique local ID within a work-group



## **Contexts and Queues**

Contexts are used to contain and manage the state of the "world"

- Kernels are executed in contexts defined and manipulated by the host
  - Devices
  - Kernels OpenCL functions
  - Program objects kernel source and executable
  - Memory objects
  - Command-queue coordinates execution of kernels
    - Kernel execution commands
    - Memory commands transfer or mapping of memory object data
    - Synchronization commands constrains the order of commands
- Applications queue compute kernel execution instances
  - Queued in-order
  - Executed in-order or out-of-order
  - Events are used to implement appropriate synchronization of execution instances























07.01.2013



































ParProg | Hardware

63

Simple Queuing Management System Utility Load User Scheduler Dispatcher Program Balancer • Utility Program - Command line tool for the user • Scheduler - Subsystem that services users requests Compute Node • After user submits a job, scheduler queues job in its queue · Makes decision based on scheduling policy • Queue - Collection of jobs, order based on attributes/policy • Dispatcher - Performs the submission of jobs in queue • Load Balancer - Selects appropriate set of compute nodes, based on monitoring 64 PT 2010 ParProg | Hardware

















|                         | MPP                                           | SMP                                             | Cluster                         | Distributed                                   |
|-------------------------|-----------------------------------------------|-------------------------------------------------|---------------------------------|-----------------------------------------------|
| Number of nodes         | O(100)-O(1000)                                | O(10)-O(100)                                    | O(100) or less                  | O(10)-O(1000)                                 |
| Node Complexity         | Fine grain                                    | Medium or coarse<br>grained                     | Medium grain                    | Wide range                                    |
| Internode communication | Message passing /<br>shared variables<br>(SM) | Centralized and<br>distributed shared<br>memory | Message Passing                 | Shared files, RPC,<br>Message Passing,<br>IPC |
| Job scheduling          | Single run queue on<br>host                   | Single run queue<br>mostly                      | Multiple queues but coordinated | Independent queue                             |
| SSI support             | Partially                                     | Always in SMP                                   | Desired                         | No                                            |
| Address Space           | Multiple                                      | Single                                          | Multiple or single              | Multiple                                      |
| Internode Security      | Irrelevant                                    | Irrelevant                                      | Required if exposed             | Required                                      |
| Ownership               | One organization                              | One organization                                | One or many organizations       | Many organization                             |

































|                                                   | ->                | 2D) PM 2i                             | Shuffle-Excha      | ange Hypercube |
|---------------------------------------------------|-------------------|---------------------------------------|--------------------|----------------|
| Gitter(2D) $- \frac{sqrt(2)}{sqrt(n)} sqrt(n)$    | (2D) —            | sqrt(2)                               | sqrt(n)            | sgrt(n)        |
| PM 2i $1 \frac{2}{-} \log_2 n 2$                  | 1                 | 2                                     | log <sub>2</sub> n | 2              |
| Shuffle-Exchange 2*log2n 2*log2n - log2 n +       | e-Exchange 2 * 10 | g <sub>2</sub> n 2*log <sub>2</sub> r | 1 —                | $\log_2 n + 1$ |
| Hypercube log <sub>2</sub> n log <sub>2</sub> n — | cube log2         | n log <sub>2</sub> n                  | log <sub>2</sub> n | _              |









|                    | PRAM                                                                                                                                                                                |
|--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| PRAM<br>extensions | Exclusive-read, exclusive-write (EREW) PRAM<br>- exklusiver Zugang zu jedem Speicherwort<br>- schwächstes Modell: Speicherverwaltung<br>muß Minimum an Nebenläufigkeit unterstützen |
|                    | Concurrent-read, exclusive-write (CREW) PRAM<br>- mehrere sim ultane Lesezugriffe auf Speicherwort<br>- Schreibzugriffe serialisiert                                                |
|                    | Exclusive-read, concurrent-write (ERCW) PRAM - parallele Schreibzugriffe auf dasselbe Speicherwort - Lesezugriffe serialisiert                                                      |
|                    | Concurrent-read, concurrent-write (CRCW) PRAM<br>- mächtigstes Modell<br>- kann auf EREW simuliert werden                                                                           |



























