

IT Systems Engineering | Universität Potsdam

Parallel Programming Concepts

**GPU** Compute Devices

Frank Feinbube

Operating Systems and Middleware Prof. Dr. Andreas Polze



## The Power of GPU Compute Devices







# Wide Varity of Application Domains

6



## Why GPU Compute Devices? Short Term View: Cheap Performance



### Performance

7



### Energy / Price

- Cheap to buy and to maintain
- GFLOPS per watt: Fermi 1,5 / Keppler 5 / Maxwell 15

## Why GPU Compute Devices? Middle Term View: More Performance





ParProg | GPU Computing | FF2013

8

# Why GPU Compute Devices? Long Term View: Hybrid Computing

Dealing with massivly multi-core:

- New architectures are evaluated (Intel SCC)
- Accelerators that accompany common general purpose CPUs (Hybrid Systems)

### Hybrid Systems

GPU Compute Devices:

High Performance Computing (3 of top 5 supercomputers are GPU-based!), Business Servers, Home/Desktop Computers, Mobile and Embedded Systems

### Special-Purpose Accelerators: (de)compression, XML parsing, (en|de)cryption, regular expression matching





## History of GPU Computing



| Fixed Function<br>Graphic Pipelines          | <ul> <li>1980s-1990s; configurable, not programmable;<br/>first APIs (DirectX, OpenGL); Vertex Processing</li> </ul> |
|----------------------------------------------|----------------------------------------------------------------------------------------------------------------------|
|                                              |                                                                                                                      |
| Programmable Real-<br>Time Graphics          | <ul> <li>Since 2001: APIs for Vertex Shading, Pixel<br/>Shading and access to texture; DirectX9</li> </ul>           |
|                                              |                                                                                                                      |
| Unified Graphics and<br>Computing Processors | • 2006: NVIDIAs G80; unified processors arrays; three programmable shading stages; DirectX10                         |
|                                              |                                                                                                                      |
| General Purpose GPU<br>(GPGPU)               | <ul> <li>compute problem as native graphic operations;<br/>algorithms as shaders; data in textures</li> </ul>        |
|                                              |                                                                                                                      |
| GPU Computing                                | <ul> <li>Programming CUDA; shaders programmable;<br/>load and store instructions; barriers; atomics</li> </ul>       |
|                                              |                                                                                                                      |



## CPU vs. GPU Architecture

11





# Open Compute Language (OpenCL)







## **OpenCL Platform Model**

15



- OpenCL exposes CPUs, GPUs, and other Accelerators as "devices"
- Each "device" contains one or more "compute units", i.e. cores, SMs,...
- Each "compute unit" contains one or more SIMD "processing elements"

## **OpenCL Execution Model**



- Parallel work is submitted to devices by launching kernels
- Kernels run over global dimension index ranges (NDRange), broken up into "work groups", and "work items"
- Work items executing within the same work group can synchronize with each other with barriers or memory fences
- Work items in different work groups can't sync with each other, except by launching a new kernel



## **OpenCL Execution Model**





An example of an NDRange index space showing work-items, their global IDs and their mapping onto the pair of work-group and local IDs. [4]



19

An OpenCL kernel is executed by an array of work items.

- All work items run the same code (SPMD)
- Each work item has an index that it uses to compute memory addresses and make control decisions





[1]

Divide monolithic work item array into work groups

- Work items within a work group cooperate via shared memory, atomic operations and barrier synchronization
- Work items in different work groups cannot cooperate



## **OpenCL Memory Architecture**







22

Memory management is explicit: you must move data from host
  $\rightarrow$  global  $\rightarrow$  local... and back

| Memory<br>Type     | Keyword  | <b>Description/Characteristics</b>                                                                           |
|--------------------|----------|--------------------------------------------------------------------------------------------------------------|
| Global<br>Memory   | global   | Shared by all work items; read/write; may be cached (modern GPU), else slow; huge                            |
| Private<br>Memory  | private  | For local variables; per work item; may be mapped onto global memory (Arrays on GPU)                         |
| Local<br>Memory    | local    | Shared between workitems of a work group;<br>may be mapped onto global memory (not<br>GPU), else fast; small |
| Constant<br>Memory | constant | Read-only, cached; add. special kind for GPUs: texture memory                                                |



### Live Demo

# **OpenCL** "Hello Device"

# **OpenCL "Sudoku Validator"**



Software development kits: NVIDIA and AMD; Windows and Linux

**Special libraries:** AMD Core Math Library, BLAS and FFT libraries by NVIDIA, OpenNL for numerics and CULA for linear algebra; NVIDIA Performance Primitives library: a sink for common GPU accelerated algorithms

### Profiling and debugging tools:

- NVIDIAs Parallel Nsight for Microsoft Visual Studio
- AMDs ATI Stream Profiler
- AMDs Stream KernelAnalyzer: displays GPU assembler code, detects execution bottlenecks
- gDEBugger (platform-independent)

Big knowledge bases with tutorials, examples, articles, show cases, and developer forums

## **GPU** Computing Platforms



**AMD** R700, R800, R900







G80, G92, GT200, GF100, GF110

Geforce, Quadro, Tesla, ION NVIDIA Tesla



# GPU Hardware in Detail



**Host Interface** 





Host Interface

**GigaThread Engine** 













## GT200 – previous architecture

40

Simpler architecture, but same principles

Several Work Groups reside on one SM

- Amount depends on available resources (Shared Memory (=Local Memory in OpenCL), Registers)
- More Work Groups → better latency hiding
  - Latencies occur for memory accesses, pipelined floating-point arithmetic and branch instructions

Thread execution in "Warps" (called "wavefronts" on AMD)

- Native execution size (32 Threads for NVIDIA)
- Zero-Overhead Thread Scheduling: If one warp stalls (accesses memory) next warp is selected for execution

Streaming Multiprocessor (SM) Instruction Cache Warp Scheduler and Registers **Constant Cache** SP SP SFU SFU Shared Memory



Application creates 200.000 "Tasks"

 $\rightarrow$  Global Work Group Size: 200.000 Work Items

Programmer decides to use a Local Work Group Size of 100 Work Items

 $\rightarrow$  Number of Work Groups: 2.000 Work Groups

One Work Item requires 10 registers and 20 byte of Shared Memory; a SM has 16 KB of Shared Memory and 16.384 registers

 $\rightarrow$  Number of Work Items per SM: 16KB/20B = 819 Work Items

 $\rightarrow$  Number of Work Groups per SM:819/100 = 8 Work Groups per SM

Even if 7 Work Groups are waiting for memory, 1 can be executed.



42

Each of the Work Groups contains 100 Work Items; the Warp Size (native execution size of a SM) is 32

- $\rightarrow$  Number of Threads Executed in parallel: 32 Threads
- $\rightarrow$  Number of "Rounds" to execute a Work Group: 100/32 = 4
- $\rightarrow$  Threads running in the first 3 rounds: 32 Threads
- $\rightarrow$  Threads running in the last round: 100-32\*4=4 Threads

If one of the threads accesses memory: whole warp stalls

If one of the threads follows a differing execution path: it is executed in an additional seperate round

## Compute Capability by version



|                                            | 1.0  | 1.1 | 1.2  | 1.3  | 2.0   |
|--------------------------------------------|------|-----|------|------|-------|
| double precision floating point operations | No   |     | Yes  |      |       |
| caches                                     | No   |     |      | Yes  |       |
| max # concurrent kernels                   | 1    |     |      | 8    |       |
| max # threads per block                    | 512  |     |      | 1024 |       |
| max # Warps per MP                         | 2    | 4   | 3    | 2    | 48    |
| max # Threads per MP                       | 76   | 58  | 10   | 24   | 1536  |
| register count (32 bit)                    | 81   | 92  | 163  | 384  | 32768 |
| max shared mem per MP                      | 16KB |     | 48KB |      |       |
| # shared memory banks                      | 16   |     |      | 32   |       |

Plus: varying amounts of cores, global memory sizes, bandwidth, clock speeds (core, memory), bus width, memory access penalties ...

## The Power of GPU Computing



big performance gains for small problem sizes



<sup>\*</sup> less is better



small/moderate performance gains for large problem sizes  $\rightarrow$  further optimizations needed



<sup>\*</sup> less is better



| Algorithm Design | <ul> <li>Asynchronous, Recompute, Simple</li> </ul>        |
|------------------|------------------------------------------------------------|
| Memory Transfer  | • Chaining, Overlap Transfer & Compute                     |
| Control Flow     | <ul> <li>Divergent Branching, Predication</li> </ul>       |
| Memory Types     | <ul> <li>Local Memory as Cache, rare resource</li> </ul>   |
| Memory Access    | <ul> <li>Coalescing, Bank Conflicts</li> </ul>             |
| Sizing           | <ul> <li>Execution Size, Evaluation</li> </ul>             |
| Instructions     | <ul> <li>Shifting, Fused Multiply, Vector Types</li> </ul> |
| Precision        | Native Math Functions, Build Options                       |



### **Divergent Branching**

- Flow control instruction (if, switch, do, for, while) can result in different execution paths
- $\blacktriangleright$  Data parallel execution  $\rightarrow$  varying execution paths will be serialized
- > Threads converge back to same execution path after completion

### **Branch Predication**

- Instructions are associated with a per-thread condition code (predicate)
  - All instructions are scheduled for execution
  - Predicate true: executed normally
  - Predicate false: do not write results, do not evaluate addresses, do not read operands
- Compiler may use branch predication for if or switch statements
- Unroll loops yourself (or use #pragma unroll for NVIDIA)

## Coalesced Memory Accesses



### Simple Access Pattern

- Can be fetched in a single 64-byte transaction (red rectangle)
- Could also be permuted \*

#### **Sequential but Misaligned Access**

 Fall into single 128-byte segment: single 128-byte transaction, else: 64-byte transaction + 32byte transaction \*

### **Strided Accesses**

- Depending on stride from 1 (here) up to 16 transactions \*
- \* 16 transactions with compute capability 1.1













### Local Memory

- Memory latency roughly 100x lower than global memory latency
- Small, no coalescing problems, prone to memory bank conflicts

### **Texture Memory**

- 2-dimensionally cached, read-only
- Can be used to avoid uncoalesced loads form global memory
- Used with the image data type



### **Constant Memory**

- Lineary cached, read-only, 64 KB
- as fast as reading from a register for the same address
- Can be used for big lists of input arguments

## Memory Bank Conflicts

50

- Access to (Shared) Memory is implemented via hardware memory banks
- If a thread accesses a memory address this is handled by the responsible memory bank

 Simple Access Patterns like this one are fetched in a single transaction



## Memory Bank Conflicts

Permuted Memory Access (left)

 Still one transaction on cards with compute capability >=1.2; otherwise 16 transactions are required

Strided Memory Access (right)

 Still one transaction on cards with compute capability >=1.2; otherwise 16 transactions are required



## Memory Bank Conflicts

52

### Bank conflicts

- Left figure: 2 bank conflicts → resulting bandwidth is ½ of the original bandwidth
- Right figure: 8 bank conflicts → resulting bandwidth is 1/8 of the original bandwidth





- Local work item count should be a multiple of native execution size (NVIDIA 32, AMD 64), but not to big
- Number of work groups should be multiple of the number of multiprocessors (hundreds or thousands of work groups)
- Can be configured in 1-, 2- or 3-dimensional layout: consider access patterns and caching
- Balance between latency hiding and resource utilization
- Experimenting is required!



## Instructions and Precision



- Single precision floats provide best performance
- Use shift operations to avoid expensive division and modulo calculations
- Special compiler flags
- AMD has native vector type implementation; NVIDIA is scalar
- Use the native math library whenever speed trumps precision

| Functions                                                                     | Throughput                   |
|-------------------------------------------------------------------------------|------------------------------|
| single-precision floating-point add, multiply, and multiply-add               | 8 operations per clock cycle |
| single-precision reciprocal,<br>reciprocal square root, and<br>native_logf(x) | 2 operations per clock cycle |
| native_sin, native_cos, native_exp                                            | 1 operation per clock cycle  |
| Prog   CPU Computing   FE2013                                                 |                              |



#### 55

### http://www.dcl.hpi.uni-potsdam.de/research/gpureadings/

- [1] Kirk, D. B. & Hwu, W. W., 2010. Programming Massively Parallel Processors: A Hands-on Approach. 1 ed. Morgan Kaufmann.
- [2] Herlihy, M. & Shavit, N., 2008. *The Art of Multiprocessor Programming*.
- [3] Sanders, J. & Kandrot, E., 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming . 1 ed. Addison-Wesley Professional.
- [4] Munshi, A. (ed.), 2010. The OpenCL Specification v1.1. The Khronos Group Inc.
- [5] Mattson, T., 2010. The Future of Many Core Computing: Software for many core processors.
- [6] NVIDIA, 2009. NVIDIA OpenCL Best Practices Guide Version 2.3.
- [7] Rob Farber, 2008. CUDA, Supercomputing for the Masses. Dr. Dobb's
- [8] NVIDIA, 2010. OpenCL Programming for the CUDA Architecture Version 3.1
- [9] Ryan Smith, NVIDIA's GeForce GTX 480 and GTX 470: 6 Months Late, Was It Worth the Wait?