

IT Systems Engineering | Universität Potsdam

### Parallel Programming Concepts

**GPU Compute Devices** 

Frank Feinbube

Operating Systems and Middleware Prof. Dr. Andreas Polze

#### HPI Hasso Plattner Institut

### The Power of GPU Compute Devices





```
Using CUDA device [0]: GeForce GTX 275

Sorting 1048576 32-bit unsigned int keys and values

radixSort, Throughput = 74.6231 MElements/s, Time = 0.01405 s, Size = 1048576 el

ements, NumDevsUsed = 1, Workgroup = 256

PASSED

RadixSort
```



http://www.nvidia.com/object/cuda

apps

flash

new.html

### Wide Varity of Application Domains



ParProg | GPU Computing | FF2011

# Why GPU Compute Devices? Short Term View: Cheap Performance



′

#### Performance



#### Energy / Price

- Cheap to buy and to maintain
- GFLOPS per watt: Fermi 1,5 / Keppler 5 / Maxwell 15

## Why GPU Compute Devices? Middle Term View: More Performance



8



# Why GPU Compute Devices? Long Term View: Hybrid Computing



Dealing with massivly multi-core:

- New architectures are evaluated (Intel SCC)
- Accelerators that accompany common general purpose CPUs (Hybrid Systems)

#### Hybrid Systems

GPU Compute Devices:

High Performance Computing (3 of top 5 supercomputers are GPU-based!),
Business Servers, Home/Desktop Computers,
Mobile and Embedded Systems

Special-Purpose Accelerators:
 (de)compression, XML parsing, (en|de)cryption,
 regular expression matching





### History of GPU Computing



## Fixed Function Graphic Pipelines

• 1980s-1990s; configurable, not programmable; first APIs (DirectX, OpenGL); Vertex Processing

#### Programmable Real-Time Graphics

 Since 2001: APIs for Vertex Shading, Pixel Shading and access to texture; DirectX9

## Unified Graphics and Computing Processors

 2006: NVIDIAs G80; unified processors arrays; three programmable shading stages; DirectX10

## General Purpose GPU (GPGPU)

 compute problem as native graphic operations; algorithms as shaders; data in textures

### **GPU Computing**

 Programming CUDA; shaders programmable; load and store instructions; barriers; atomics

### CPU vs. GPU Architecture





ParProg | GPU Computing | FF2011



### Open Compute Language (OpenCL)



### OpenCL Platform Model





- OpenCL exposes CPUs, GPUs, and other Accelerators as "devices"
- Each "device" contains one or more "compute units", i.e. cores, SMs,...
- Each "compute unit" contains one or more SIMD "processing elements"

### OpenCL Execution Model



- Parallel work is submitted to devices by launching kernels
- Kernels run over global dimension index ranges (NDRange), broken up into "work groups", and "work items"
- Work items executing within the same work group can synchronize with each other with barriers or memory fences

■ Work items in different work groups can't sync with each other, except by launching a new kernel work-group size S<sub>x</sub>



### OpenCL Execution Model





An example of an NDRange index space showing work-items, their global IDs and their mapping onto the pair of work-group and local IDs. [4]





An OpenCL kernel is executed by an array of work items.

- All work items run the same code (SPMD)
- Each work item has an index that it uses to compute memory addresses and make control decisions





### Work Groups: Scalable Cooperation

Divide monolithic work item array into work groups

- Work items within a work group cooperate via shared memory, atomic operations and barrier synchronization
- Work items in different work groups cannot cooperate



### OpenCL Memory Architecture





**Private** 

Per work-item

Local
Shared within
a workgroup

Global/ Constant

Visible to all workgroups

**Host Memory** 

On the CPU

### OpenCL Memory Architecture



■ Memory management is explicit: you must move data from host → global → local... and back

| Memory<br>Type     | Keyword  | Description/Characteristics                                                                            |
|--------------------|----------|--------------------------------------------------------------------------------------------------------|
| Global<br>Memory   | global   | Shared by all work items; read/write; may be cached (modern GPU), else slow; huge                      |
| Private<br>Memory  | private  | For local variables; per work item; may be mapped onto global memory (Arrays on GPU)                   |
| Local<br>Memory    | local    | Shared between workitems of a work group; may be mapped onto global memory (not GPU), else fast; small |
| Constant<br>Memory | constant | Read-only, cached; add. special kind for GPUs: texture memory                                          |

### **GPU Computing Platforms**



AMD

R700, R800, R900





NVIDIA INVIDIA.

G80, G92, GT200, GF100, GF110

Geforce, Quadro, Tesla, ION





### GPU Hardware in Detail













Simpler architecture, but same principles

Several Work Groups reside on one SM

- Amount depends on available resources (Shared Memory (=Local Memory in OpenCL), Registers)
- More Work Groups → better latency hiding
  - Latencies occur for memory accesses,
     pipelined floating-point arithmetic and branch instructions

Thread execution in "Warps" (called "wavefronts" on AMD)

- Native execution size (32 Threads for NVIDIA)
- Zero-Overhead Thread Scheduling: If one warp stalls (accesses memory) next warp is selected for execution



ParProg | GPU Computing | FF2011

### Warp Execution Example



Application creates 200.000 "Tasks"

→ Global Work Group Size:

200.000 Work Items

Programmer decides to use a Local Work Group Size of 100 Work Items

→ Number of Work Groups:

2.000 Work Groups

One Work Item requires 10 registers and 20 byte of Shared Memory; a SM has 16 KB of Shared Memory and 16.384 registers

- → Number of Work Items per SM: 16KB/20B = 819 Work Items
- → Number of Work Groups per SM:819/100 = 8 Work Groups per SM

Even if 7 Work Groups are waiting for memory, 1 can be executed.

### Warp Execution Example



Each of the Work Groups contains 100 Work Items; the Warp Size (native execution size of a SM) is 32

- → Number of Threads Executed in parallel: 32 Threads
- → Number of "Rounds" to execute a Work Group: 100/32 = 4
- → Threads running in the first 3 rounds: 32 Threads
- $\rightarrow$  Threads running in the last round: 100-32\*4=4 Threads

If one of the threads accesses memory: whole warp stalls If one of the threads follows a differing execution path: it is executed in an additional seperate round



### Compute Capability by version

|                                            | 1.0  | 1.1 | 1.2 | 1.3  | 2.0   |
|--------------------------------------------|------|-----|-----|------|-------|
| double precision floating point operations | No   |     | Yes |      |       |
| caches                                     | No   |     |     | Yes  |       |
| max # concurrent kernels                   | 1    |     |     | 8    |       |
| max # threads per block                    | 512  |     |     | 1024 |       |
| max # Warps per MP                         | 2    | 4   | 3   | 2    | 48    |
| max # Threads per MP                       | 76   | 58  | 10  | 24   | 1536  |
| register count (32 bit)                    | 81   | 92  | 163 | 384  | 32768 |
| max shared mem per MP                      | 16KB |     |     | 48KB |       |
| # shared memory banks                      | 16   |     |     | 32   |       |

Plus: varying amounts of cores, global memory sizes, bandwidth, clock speeds (core, memory), bus width, memory access penalties ...





big performance gains for small problem sizes



\* less is better





small/moderate performance gains for large problem sizes

→ further optimizations needed



\* less is better



### Best Practices for Performance Tuning

ParProg | GPU Computing | FF2011

Algorithm Design Asynchronous, Recompute, Simple Memory Transfer • Chaining, Overlap Transfer & Compute Control Flow Divergent Branching, Predication Memory Types • Local Memory as Cache, rare resource Coalescing, Bank Conflicts Memory Access Sizing Execution Size, Evaluation **Instructions**  Shifting, Fused Multiply, Vector Types Precision • Native Math Functions, Build Options



### Divergent Branching and Predication

#### **Divergent Branching**

- Flow control instruction (if, switch, do, for, while) can result in different execution paths
- ▶ Data parallel execution → varying execution paths will be serialized
- Threads converge back to same execution path after completion

#### **Branch Predication**

- Instructions are associated with a per-thread condition code (predicate)
  - All instructions are scheduled for execution
  - Predicate true: executed normally
  - Predicate false: do not write results, do not evaluate addresses, do not read operands
- Compiler may use branch predication for if or switch statements
- Unroll loops yourself (or use #pragma unroll for NVIDIA)

### Coalesced Memory Accesses



#### **Simple Access Pattern**

- Can be fetched in a single 64-byte transaction (red rectangle)
- Could also be permuted \*



#### **Sequential but Misaligned Access**

■ Fall into single 128-byte segment: single 128-byte transaction, else: 64-byte transaction + 32-byte transaction \*



#### **Strided Accesses**

Depending on stride from 1 (here) up to 16 transactions \*

<sup>\* 16</sup> transactions with compute capability 1.1





### Use Caching: Local, Texture, Constant

#### **Local Memory**

- Memory latency roughly 100x lower than global memory latency
- Small, no coalescing problems, prone to memory bank conflicts

#### **Texture Memory**

- 2-dimensionally cached, read-only
- Can be used to avoid uncoalesced loads form global memory
- Used with the image data type

| 0   | 1   | 2           | 3   |  |
|-----|-----|-------------|-----|--|
| 64  | 65  | <b>→</b> 66 | 67  |  |
| 128 | 129 | 130         | 131 |  |
| 192 | 193 | 194         | 195 |  |

#### **Constant Memory**

- Lineary cached, read-only, 64 KB
- as fast as reading from a register for the same address
- Can be used for big lists of input arguments

- Access to (Shared) Memory is implemented via hardware memory banks
- If a thread accesses a memory address this is handled by the responsible memory bank

 Simple Access Patterns like this one are fetched in a single transaction



ParProg | GPU Computing | FF2011

[8]

### Permuted Memory Access (left)

Still one transaction on cards with compute capability >=1.2; otherwise 16 transactions are required

#### Strided Memory Access (right)

Still one transaction on cards with compute capability >=1.2; otherwise 16 transactions are required



Address 128

Address 140

Address 152

Address 164

Address 176

Address 188

ParProg | GPU Computing | FF2011

#### Thread 0 Bank 0 Thread 0 Bank 0 Bank 1 Thread 1 Thread 1 Bank 1 Memory Bank Conflicts Bank 2 Thread 2 Bank 2 Thread 2 Bank 3 Bank 3 Thread 3 Thread 3 Bank conflicts Bank 4 Thread 4 Bank 4 Thread 4 Bank 5 Bank 5 Thread 5 Thread 5 Left figure: 2 bank Bank 6 conflicts → resulting Thread 6 Thread 6 Bank 6 bandwidth is ½ of the Bank 7 Thread 7 Thread 7 Bank 7 original bandwidth Bank 8 Thread 8 Bank 8 Thread 8 Right figure: 8 bank Bank 9 Thread 9 Thread 9 Bank 9 conflicts → resulting Thread 10 Bank 10 Thread 10 Bank 10 bandwidth is 1/8 of the original bandwidth Bank 11 Thread 11 Thread 11 Bank 11 Thread 12 Bank 12 Thread 12 Bank 12 Thread 13 Bank 13 Thread 13 Bank 13 Bank 14 Thread 14 Thread 14 Bank 14 Thread 15 Bank 15 Thread 15 Bank 15 [8] ParProg | GPU Computing | FF2011

52

# Sizing: What is the right execution layout?



 Local work item count should be a multiple of native execution size (NVIDIA 32, AMD 64), but not to big

 Number of work groups should be multiple of the number of multiprocessors (hundreds or thousands of work groups)

Can be configured in 1-, 2- or 3-dimensional layout: consider

access patterns and caching

 Balance between latency hiding and resource utilization

Experimenting is required!



#### Instructions and Precision



- Single precision floats provide best performance
- Use shift operations to avoid expensive division and modulo calculations
- Special compiler flags
- AMD has native vector type implementation; NVIDIA is scalar
- Use the native math library whenever speed trumps precision

| Functions                                                               | Throughput                   |  |  |
|-------------------------------------------------------------------------|------------------------------|--|--|
| single-precision floating-point add, multiply, and multiply-add         | 8 operations per clock cycle |  |  |
| single-precision reciprocal, reciprocal square root, and native_logf(x) | 2 operations per clock cycle |  |  |
| native_sin, native_cos, native_exp                                      | 1 operation per clock cycle  |  |  |

### Further Readings



#### http://www.dcl.hpi.uni-potsdam.de/research/gpureadings/

- [1] Kirk, D. B. & Hwu, W. W., 2010. *Programming Massively Parallel Processors: A Hands-on Approach*. 1 ed. Morgan Kaufmann.
- [2] Herlihy, M. & Shavit, N., 2008. *The Art of Multiprocessor Programming*.
- [3] Sanders, J. & Kandrot, E., 2010. *CUDA by Example: An Introduction to General-Purpose GPU Programming*. 1 ed. Addison-Wesley Professional.
- [4] Munshi, A. (ed.), 2010. The OpenCL Specification v1.1. The Khronos Group Inc.
- [5] Mattson, T., 2010. The Future of Many Core Computing: Software for many core processors.
- [6] NVIDIA, 2009. NVIDIA OpenCL Best Practices Guide Version 2.3.
- [7] Rob Farber, 2008. CUDA, Supercomputing for the Masses. Dr. Dobb's
- [8] NVIDIA, 2010. OpenCL Programming for the CUDA Architecture Version 3.1
- [9] Ryan Smith, NVIDIA's GeForce GTX 480 and GTX 470: 6 Months Late, Was It Worth the Wait?