

Digital Engineering • Universität Potsdan

# Parallel Programming and Heterogeneous Computing

SIMD: Integrated Accelerators

Max Plauth, <u>Sven Köhler</u>, Felix Eberhardt, Lukas Wenzel, and Andreas Polze Operating Systems and Middleware Group



#### I I I I D → -D D D D D D - D - D - D - D → D - D -D -D → SIMD D — D — — D — D — D -D → D — D → & AltiVec

ParProg20 C1 Integrated Accelerators

Sven Köhler

# Definition SIMD

SIMD ::= **S**ingle Instruction **M**ultiple **D**ata The same instruction is performed simultaneously on multiple data points (fit for data-level parallelism).

First proposed for ILLIAC IV, University of Illinois (1966).

Today many architectures provide SIMD instruction set extensions.

Intel: MMX, SSE, AVX ARM: VPF, NEON, SVE POWER: AltiVec (VMX), VSX Data Parallelism



ParProg20 C1 Integrated

Accelerators Sven Köhler



Scalar vs. SIMD



How many instructions are needed to add four numbers from memory?

 $A_2$ 

 $A_3$ 



scalar

 $\begin{array}{c|c} A_0 \\ A_1 \\ + \end{array} \begin{array}{c} B_0 \\ B_1 \\ = \end{array} \begin{array}{c} C_0 \\ C_1 \\ \end{array}$ 

 $B_2$ 

 $B_3$ 

 $C_2$ 

**C**<sub>3</sub>

4 element SIMD

ParProg20 C1 Integrated Accelerators

Sven Köhler

4 additions 8 loads 4 stores 1 addition 2 loads 1 store

Vector Registers on POWER8 (1)

32 vector registers containing 128 bits each.



AltiVec/VMX



These are also used by several **coprocessors**:

VSX

VSX SHA2 AES ...

ParProg20 C1 Integrated Accelerators

Sven Köhler



Vector Registers on POWER8 (2)

32 vector registers containing 128 bits each.

or

Depending on the instruction they are interpreted as

| 16          | (un)signed bytes                  |                             |
|-------------|-----------------------------------|-----------------------------|
| 8           | (un)signed shorts                 |                             |
| 4           | (un)signed integers of 32bit      |                             |
| 4           | single precision floats           |                             |
| 2           | (un)signed long integers of 64bit | ParProg20 C1<br>Integrated  |
| 2           | double precision floats           | Accelerators<br>Sven Köhler |
| 2, 4, 8, 16 | logic values                      | Sven Romer                  |
|             |                                   |                             |



# **AltiVec Instruction Reference**

Plattner Institut

HPI

Hasso

# For all instructions, registers and usage see

PowerISA 2.07(B), chapter 6 & 7

#### 6.7.2 Vector Load Instructions

The aligned byte, halfword, word, or quadword in storage addressed by EA is loaded into register VRT.

#### Programming Note

31

The Load Vector Element instructions load the specified element into the same location in the target register as the location into which it would be loaded using the Load Vector instruction

RB

Version 2.07 B

39

· · · · \_ - ·

#### Load Vector Element Byte Indexed X-form

Load Vector Element Halfword Indexed X-form VRT,RA,RB



Let eb be bits 60:63 of EA.

If Big-Endian byte ordering is used for the storage access, the contents of the byte in storage at address EA are placed into byte eb of register VRT. The remaining bytes in register VRT are set to undefined values.

#### if RA = 0 then b $\leftarrow$ 0 مادم b ← (RA) EA ← (b + (RB)) & 0xFFFF\_FFFF\_FFFF eb ← EAso.s3 VRT ← undefined if Big-Endian byte ordering then $VRT_{8 \times eb: 8 \times eb+15} \leftarrow MEM(EA, 2)$

VRT RA

```
else
   VRT<sub>112-(8×eb)</sub>:127-(8×eb) ← MEM(EA,2)
```

Let the effective address (EA) be the result of ANDing the sum (RA|0)+(RB).

#### Let eb be bits 60:63 of EA.

If Big-Endian byte ordering is used for the storage access. 

#### ParProg20 C1 Integrated **Accelerators**





## #include <altivec.h>

gcc -maltivec -mabi=altivec
gcc -mvsx

xlc -qaltivec -qarch=auto

**C**-Interface

ParProg20 C1 Integrated Accelerators

Sven Köhler

gcc -maltivec

#### gcc -mvsx

Chart 9

### The C-Interface introduces new keywords and data types: vector unsigned char 16x 1 byte vector signed char

vector unsigned short vector signed short vector bool short vector pixel

vector bool char

vector unsigned int vector signed int vector bool int vector float

ParProg20 C1 Integrated **Accelerators** 

Sven Köhler



# Vector Data Types

4x 4 bytes

8x 2 bytes

vector unsigned long vector signed long vector double

2 x 8 bytes

HPI Hasso Plattner Institut



vector int va =  $\{1, 2, 3, 4\};$ 

int data[] = {1, 2, 3, 4, 5, 6, 7, 8}; vector int vb = \*((vector int \*)data);

```
int output[4];
*((vector int *)output) = va;
```



ParProg20 C1 Integrated Accelerators



Historically memory addresses required be **aligned at 16 byte** boundaries for efficiency reasons. (Although POWER8 has improved unaligned load/store and modern compilers will support you.)

```
vector int va = vec_ld(0, data);
vec_st(va, 0, output);
index + address (truncated to 16)
```

ParProg20 C1 Integrated Accelerators

Sven Köhler

ihttps://gcc.gnu.org/onlinedocs/gcc-8.4.0/gcc/PowerPC-AltiVec\_002fVSX-Built-in-Functions.html

## Vector Intrinsics

Operations are available through a rich set<sup>1</sup> of "overloaded functions" (actually intrinsics):

```
vector int va = {4, 3, 2, 1};
vector int vb = {1, 2, 3, 4};
vector int vc = vec_add(va, vb);
```

vector float vfa = {4, 3, 2, 1}; vector float vfb = {1, 2, 3, 4}; vector float vfc = vec\_add(vfa, vfb);

ParProg20 C1 Integrated Accelerators







#### HPI Hasso Plattner Institut

## Vector Intrinsics: Lots of overloads

vector signed char vec\_add (vector bool char, vector signed char); vector signed char vec\_add (vector signed char, vector bool char); vector signed char vec add (vector signed char, vector signed char); ctor unsigned char vec\_add (vector bool char, vector unsigned char); ector unsigned char vec\_add (vector unsigned char, vector bool char); vector unsigned char vec\_add (vector unsigned char, vector unsigned char); vector signed short vec\_add (vector bool short, vector signed short); vector signed short vec\_add (vector signed short, vector bool short); vector unsigned shor Attention: No implicit conversion! short); unsigned short): vector signed int Also not all types for every operation.

vector unsigned int vec\_add (vector unsigned int, vector bool int); vector unsigned int vec\_add (vector unsigned int, vector unsigned int); vector float vec\_add (vector float, vector float); vector double vec\_add (vector double, vector double); vector long long vec\_add (vector long long, vector long long); vector unsigned long long vec\_add (vector unsigned long long, vector unsigned long long); ParProg20 C1 Integrated Accelerators

Sven Köhler

Chart 13

ps://gcc.gnu.org/onlinedocs/gcc-8.4.0/gcc/PowerPC-AltiVec\_002fVSX-Built-in-Functions.html

# Get Help: Programming Interface Manual

Highly helpful resource:

- Name of operation
- Pseudocode description
- Text description
- Graphical description
- Type table and according assembly instruction

Generic and Specific AltiVec Operations

| vec_add    |  |
|------------|--|
| Vector Add |  |
| 1 11/ 1    |  |

 $\mathbf{d} = \operatorname{vec} \operatorname{add}(\mathbf{a}, \mathbf{b})$ Integer add:

n ← number of elements do i=0 to n-1  $d_i \leftarrow a_i + b_i$ end

Floating-point add:

```
do i=0 to 3
d_i \leftarrow a_i +_{fp} b_i
end
```

Each element of a is added to the corresponding element of b. Each sum is placed in the corresponding element of d.

For vector float argument types, if VSCR[NJ] = 1, every denormalized operand element is truncated to a 0 of the same sign before the operation is carried out, and each denormalized result element is truncated to a 0 of the same sign.

The valid combinations of argument types and the corresponding result types for d = vec add(a,b) are shown in Figure 4-12, Figure 4-13, Figure 4-14, and Figure 4-15.



| d                    | а                    | b                    | maps to        |
|----------------------|----------------------|----------------------|----------------|
|                      | vector unsigned char | vector unsigned char |                |
| vector unsigned char | vector unsigned char | vector bool char     |                |
|                      | vector bool char     | vector unsigned char | vaddubm d.a.b  |
|                      | vector signed char   | vector signed char   | vaddubin d,a,b |

ParProg20 C1 Integrated **Accelerators** 

HPI

Hasso Plattner

Institut

Sven Köhler

Chart 14

http://www.nxp.com/files/32bit/doc/ref manual/ALTIVECPIM.pdf

#### vec add



# Get Help: IBM Knowledge Center

IBM has an online documentation of the extended standard,

# not fully implemented by GCC.



| u                         | •                         | U                         |
|---------------------------|---------------------------|---------------------------|
| vector signed char        | vector signed char        | vector signed char        |
| vector unsigned char      | vector unsigned char      | vector unsigned char      |
| vector signed short       | vector signed short       | vector signed short       |
| vector unsigned short     | vector unsigned short     | vector unsigned short     |
| vector signed int         | vector signed int         | vector signed int         |
| vector unsigned int       | vector unsigned int       | vector unsigned int       |
| vector signed long long   | vector signed long long   | vector signed long long   |
| vector unsigned long long | vector unsigned long long | vector unsigned long long |
| vector float              | vector float              | vector float              |
| vector double             | vector double             | vector double             |
|                           |                           |                           |

#### Result value

The value of each element of the result is the sum of the corresponding elements of a and b. For integer vectors and unsigned vectors, the arithmetic is modular.

Parent topic: Vector built-in functions

[Provide feedback]



#### ParProg20 C1 Integrated Accelerators

Sven Köhler



- vec\_add(a, b) Add a and b element-wise
- vec\_sub(a, b) Subtract a and b element-wise
- vec\_mul(a, b) Multiply a and b element-wise (gcc: float only)
- vec\_madd(a, b, c) Multiply a and b element-wise and add elements of c
- vec\_min(a, b) Select element-wise the minimum of a and b
- vec\_re(a) Compute reciprocals of elements
- vec\_sqrt(a) Calculate square root of elements
- vec\_sr(a, b) Right-shift elements of vector a depending on certain bits in b

ParProg20 C1 Integrated Accelerators



# Conversion of Floating-Point Types

Idea behind this: **Fixed-point numbers** of n digits. For just plain conversion use n = 0.

ParProg20 C1 Integrated Accelerators

Sven Köhler

Vector Data Realignment and Permutation (1)

Sometimes memory is not correctly ordered for a certain tasks.

Example: Squared absolute of 2D points ( $r^2 = p_x^2 + p_y^2$ )







res = vec\_perm(a, b, pattern)

Bytewise rearrange two vectors according to provided pattern.

pattern denotes indices in assumed 32 byte array of concatenated a and b.



# Vector Bit Selection (1)

Sometimes two vectors should be combined, but their bytes not moved.

Example: Every even element of a vector should be rounded up, and every odd one rounded down.

vector float a = vec\_ceil(X); vector float b = vec\_floor(X); vector unsigned int pattern = {0, 0xffffffff, 0, 0xffffffff}; vector float res = vec\_sel(a, b, pattern); ParProg20 C1 Integrated Accelerators







# Vector Bit Selection (2)



res = vec\_sel(a, b, pattern)

**Bit-wise** pick contents from a or b, depending if corresponding bit in pattern is 0 or 1.



ParProg20 C1 Integrated Accelerators

Sven Köhler

res[bit i] = a[bit i] if pattern[bit i] == 0 else b[bit i]

Conditional Programming (1)

There are **no branches** for element computation in AltiVec. Instead compute both variants and then use **bit-wise select**.



HPI Hasso Plattner Institut

# Conditional Programming (2)

## Remember the vector types?

vector unsigned char vector signed char vector bool char

```
vector unsigned short
vector signed short
vector bool short
vector pixel
```

vector unsigned int vector signed int vector bool int vector float 16x false (= 0x0) or true (0xff)

8x false (= 0x0) or true (0xffff)

4x false (= 0x0) or true (0xffffffff)

ParProg20 C1 Integrated Accelerators



Conditional Programming (3)

vector bool int res = vec\_cmpgt(a, b);



vec\_cmpgt

vec\_cmple

vec\_cmplt

- vec\_cmpge >=(for gcc on floats only)
- vec\_cmpeq ==

>

<= (for gcc on **floats only**)
<

vec\_and(a & b)Integrated<br/>Acceleratorsvec\_or(a | b)Sven Köhlervec\_nand~(a & b)Vec\_orcvec\_orc(a | ~b)Chart 24

. . .



ParProg20 C1



```
vector signed int calc_abs(vector signed int a)
{
    vector signed int vzero = {0, 0, 0, 0};
    vector signed int neg_a = vec_sub(vzero, a);
    vector bool int vpat = vec_cmpgt(vzero, a);
```

YUNO vec\_abs(a)

ParProg20 C1 Integrated Accelerators

Sven Köhler





void scale(float \*input, int num,
 float scale)

```
int i;
for (i = 0; i < num; i++) {
    input[i] *= scale;</pre>
```

ParProg20 C1 Integrated Accelerators

Sven Köhler

# Learning by example



```
void scale(float *input, int num, float scale)
{
    int i;
    vector float vscale = {scale, scale, scale, scale};
    for (i = 0; i < num; i += 4) {
        vector float *current = ((vector float *)&input[i]);
        *current = vec_mul(vscale, *current);
    }
</pre>
```

<Do you see a problem?>

ParProg20 C1 Integrated Accelerators

Scale an Array by Factor (Vector, Safe)



```
void scale(float *input, int num, float scale)
ł
    int i;
    vector float vscale = {scale, scale, scale, scale};
    for (i = 0; i < num - 4; i += 4) {
        vector float *current = ((vector float *)&input[i]);
        *current = vec_mul(vscale, *current);
    }
    for (; i < num; i++) {</pre>
        input[i] = scale * input[i];
                                                               ParProg20 C1
                                                               Integrated
                                                               Accelerators
```

```
void scale(float *input, int num, float scale)
ł
    int i;
    vector float vscale = {scale, scale, scale, scale};
    vector float *vinput = (vector float *)input;
    for (i = 0; i < num / 4; i++) {
        vinput[i] = vec_mul(vscale, vinput[i]);
    }
    for (i = (num / 4) * 4; i < num; i++) {
        input[i] = scale * input[i];
    }
```

Integrated Accelerators

ParProg20 C1



Squared Absolute of Points (1)

```
struct point2d {
    float x, y;
};
```



ParProg20 C1 Integrated Accelerators

# Squared Absolute of Points (2) – Permute Bytes to Get X



| va               |                  |                  |                  |                  |                  |                  |                  |                  |                  |                  |                  |                  |                  |                  |                  | vb               |                  |                  |                  |                           |                  |                  |                |
|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|---------------------------|------------------|------------------|----------------|
|                  | Х                | 0                |                  |                  | Y                | 0                |                  |                  | Х                | 1                |                  |                  | Y                | 1                |                  |                  | Х                | 2                |                  |                           | Y                | 2                |                |
| 0                |                  |                  |                  | 4                |                  |                  |                  | 8                |                  |                  |                  | 12               |                  |                  |                  | 16               |                  |                  |                  | 20                        |                  |                  |                |
| X <sub>0-0</sub> | X <sub>0-1</sub> | X <sub>0-2</sub> | Х <sub>0-3</sub> | Y <sub>0-0</sub> | Y <sub>0-1</sub> | Y <sub>0-2</sub> | Ү <sub>0-3</sub> | X <sub>1-0</sub> | X <sub>1-1</sub> | X <sub>1-2</sub> | X <sub>1-3</sub> | Y <sub>1-0</sub> | Υ <sub>1-1</sub> | Υ <sub>1-2</sub> | Y <sub>1-3</sub> | X <sub>2-0</sub> | X <sub>2-1</sub> | X <sub>2-2</sub> | X <sub>2-3</sub> | Y <sub>2-0</sub>          | Υ <sub>2-1</sub> | Υ <sub>2-2</sub> | Y <sub>2</sub> |
|                  |                  |                  |                  |                  |                  |                  |                  | 7                |                  | •                |                  |                  |                  |                  |                  |                  |                  |                  |                  |                           |                  |                  |                |
| pat>             | X                |                  |                  |                  | _                |                  |                  |                  |                  |                  |                  |                  |                  |                  |                  |                  |                  |                  |                  |                           |                  |                  |                |
| 0                | 1                | 2                | 3                | 8                | 9                | 10               | 11               | 16               | 17               | 18               | 19               | 24               | 25               | 26               | 27               |                  |                  |                  |                  |                           |                  |                  |                |
| VX               | = v              | ec               | per              | m(∨              | а.               | vb.              | pa <sup>-</sup>  | tx)              | •                |                  |                  |                  |                  |                  |                  |                  |                  |                  | Inte             | Prog2<br>egrate<br>elerat | ed               |                  |                |
| X <sub>0-0</sub> | X <sub>0-1</sub> | X <sub>0-2</sub> | X <sub>0-3</sub> | X <sub>1-0</sub> | X <sub>1-1</sub> | X <sub>1-2</sub> | X <sub>1-3</sub> | X <sub>2-0</sub> | X <sub>2-1</sub> | X <sub>2-2</sub> | X <sub>2-3</sub> | X <sub>3-0</sub> | X <sub>3-1</sub> | X <sub>3-2</sub> | Х <sub>3-3</sub> |                  |                  |                  |                  | n Köhl                    |                  |                  |                |

# Squared Absolute of Points (2) – Permute Bytes to Get Y







# <Any endianness issues here?>

Rule of thumb: No element size or storage for platform change => No endianness issues!

ParProg20 C1 Integrated Accelerators

Sven Köhler



ParProg20 C1

Integrated Accelerators Sven Köhler

```
int i;
vector float *vinput = (vector float *)input;
vector float *voutput = (vector float *)output;
for (i = 0; i < num / 4; i++) {
    vector float va = vinput[2 * i];
    vector float vb = vinput[2 * i + 1];
    vector float vx = vec_perm(va, vb, patx);
    vector float vy = vec_perm(va, vb, paty);
    voutput[i] = vec_add(vec_mul(vx, vx), vec_mul(vy, vy));
}
for (i = 4 * (num / 4); i < num; i++) {
    output[i] = input[i].x * input[i].x
                + input[i].y * input[i].y;
```



# Short overview of SS[S]E[2,3,4]/AVX[-2,-512]

ParProg20 C1 Integrated Accelerators

Sven Köhler

# Vector registers on Intel architectures

\_\_m128

\_\_m128d

\_\_m128i

m256d

\_ m256i

\_\_m512

\_\_m256



AVX-512 register scheme as extension from the AVX (YMM0-YMM15) and SSE (XMM0-XMM15) registers

| 511  | 256 | 255 | 128  | 127 | 0          |
|------|-----|-----|------|-----|------------|
| ZMMO | )   | YM  | MO   | XM  | M0         |
| ZMM1 | I   | YM  | M1   | XM  | M1         |
| ZMM2 | 2   | YM  | M2   | XM  | M2         |
| ZMM3 | 3   | YM  | MЗ   | XM  | M3         |
| ZMM4 | 1   | YM  | M4   | XM  | M4         |
| ZMM5 | 5   | YM  | M5   | XM  | M5         |
| ZMM6 | 6   | YM  | M6   | XM  | M6         |
| ZMM7 | 7   | YM  | M7   | XM  | M7         |
| ZMM8 | 3   | YM  | M8   | XM  | M8         |
| ZMMS | 9   | YM  | M9   | XM  | M9         |
| ZMM1 | 0   | YMI | M10  | XM  | <b>M10</b> |
| ZMM1 | 1   | YM  | M11  | XM  | M11        |
| ZMM1 | 2   |     | V12  |     | V12        |
| ZMM1 | 3   | YM  | V13  | XMI | V13        |
| ZMM1 | 4   | YMI | W14  | XMI | M14        |
| ZMM1 | 5   | YMI | W15  | XMI | M15        |
| ZMM1 | 6   | YMI | W16  | XMI | V16        |
| ZMM1 | 7   | YMI | W17  | XMI | V17        |
| ZMM1 | 8   | YMI | V18  | XMI | V18        |
| ZMM1 | 9   | YMI | V19  | XMI | V19        |
| ZMM2 | 0   | YMI | M20  | XMI | M20        |
| ZMM2 | 1   | YMI | VI21 | XMI |            |
| ZMM2 | _   | YMI | M22  | XMI | M22        |
| ZMM2 | 3   | YMI | M23  | XMI | M23        |

Overlapping register files for each ISA extension. With AVX-512 extended to 32 registers.

New C data types:

4 floats
2 doubles
multiple (un)signed integers (8-128bit)
8 floats
4 doubles
multiple (un)signed integers (8-128bit)
...

ParProg20 C1 Integrated Accelerators

```
Sven Köhler
```

Instructions typically use input registers as output: mulps r0, r1 ::= r0 \*= r1

Intrinsic function name patterns (ICC/GCC/MSVC)



#include <x86intrin.h> or #include <[version]mmintrin.h>

Dedicated intrinsic names for data types (mirrors instructions):

skipped for 128 bit (SSE)
\_\_mm[result\_bit\_width]\_<name>\_<data\_type>

ps vectors contain floats (packed single-precision)

pd vectors contain doubles (packed double-precision)

epi8/epi16/epi32/epi64

vectors contain 8-bit/16-bit/32-bit/64-bit signed integers

epu8/epu16/epu32/epu64

vectors contain 8-bit/16-bit/32-bit/64-bit unsigned integers si128/si256

unspecified 128-bit vector or 256-bit vector [e.g. loads] m128/m128i/m128d/m256/m256i/m256d

identifies input vector types, when different from the type of the returned vector

ParProg20 C1 Integrated Accelerators

Loading and Storing Memory



Memory loads require vector aligned addresses:

Values, again, can be cast too native pointers to be used for storing:

```
int *output = (int *)&vec;
__m256 *dst = (__m256 *)aligned_buffer;
dst[0] = vec;
```

```
ParProg20 C1
Integrated
Accelerators
```

```
Sven Köhler
```

```
_mm256_store[u]_ps(dst, vec);
```

# Scalar operations in vector registers





| 127  | 95   | 63   | 31 0 |
|------|------|------|------|
| 4.0  | 3.0  | 2.0  | 1.0  |
| *    | *    | *    | *    |
| 5.0  | 5.0  | 5.0  | 5.0  |
| =    | =    | =    | =    |
| 20.0 | 15.0 | 10.0 | 5.0  |

vmm 0

mulne ymml





Sven Köhler





# Intel Intrinsics Guide

|                                                             |                                                                                                        | General Software      | intel.com          | Ċ                                                                 | <u> </u>  |  |  |  |  |
|-------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|-----------------------|--------------------|-------------------------------------------------------------------|-----------|--|--|--|--|
| intel) Intrinsics Guide                                     |                                                                                                        |                       |                    | ntrinsic instructions, which are<br>/X, AVX-512, and more - witho |           |  |  |  |  |
| echnologies                                                 | ,, <b>,</b>                                                                                            |                       |                    |                                                                   |           |  |  |  |  |
|                                                             |                                                                                                        |                       |                    |                                                                   | 2         |  |  |  |  |
|                                                             | _mm_search                                                                                             |                       |                    |                                                                   | ?         |  |  |  |  |
|                                                             |                                                                                                        |                       |                    |                                                                   |           |  |  |  |  |
| SSE3                                                        | m256d _mm256_ad                                                                                        | ld_pd (m256d a,       | m256d b)           |                                                                   | vaddpd    |  |  |  |  |
| □ SSE3<br>□ SSE4.1                                          | m256 _mm256_add                                                                                        |                       | m256 b)            |                                                                   | vaddps    |  |  |  |  |
| □ SSE4.1                                                    | Synopsis                                                                                               |                       |                    |                                                                   |           |  |  |  |  |
| ☐ 33E4.2<br>☑ AVX                                           | • • •                                                                                                  |                       |                    |                                                                   |           |  |  |  |  |
| AVX                                                         | m256 _mm256_add_ps (m256 a,m256 b)<br>#include <immintrin.h></immintrin.h>                             |                       |                    |                                                                   |           |  |  |  |  |
| T FMA                                                       | Instruction: vaddps ymm, ymm                                                                           |                       |                    |                                                                   |           |  |  |  |  |
| AVX-512                                                     | CPUID Flags: AVX                                                                                       |                       |                    |                                                                   |           |  |  |  |  |
|                                                             | Description                                                                                            |                       |                    |                                                                   |           |  |  |  |  |
|                                                             | Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst. |                       |                    |                                                                   |           |  |  |  |  |
|                                                             | Operation                                                                                              |                       |                    |                                                                   |           |  |  |  |  |
| Categories Application-Targeted Arithmetic Bit Manipulation | FOR j := 0 to 7<br>i := j*32<br>dst[i+31:i<br>ENDFOR<br>dst[MAX:256] := 0                              | ] := a[i+31:i] + b[i+ | -31:i]             |                                                                   |           |  |  |  |  |
| Cast<br>Compare                                             | Performance                                                                                            |                       |                    |                                                                   |           |  |  |  |  |
| Compare                                                     | Architecture Latency                                                                                   | Throughput (CPI)      |                    |                                                                   |           |  |  |  |  |
| Cryptography                                                | Skylake 4                                                                                              | 0.5                   |                    |                                                                   |           |  |  |  |  |
| Elementary Math Functions                                   | Broadwell 3                                                                                            | 1                     |                    |                                                                   |           |  |  |  |  |
| General Support                                             |                                                                                                        |                       |                    |                                                                   |           |  |  |  |  |
| Load                                                        | Haswell 3                                                                                              | 1                     |                    |                                                                   |           |  |  |  |  |
| Logical                                                     | Ivy Bridge 3                                                                                           | 1                     |                    |                                                                   |           |  |  |  |  |
| ☐ Mask<br>☐ Miscellaneous                                   | m2E(d mm2E( =                                                                                          | Idaub pd ( m25/d      | a m2E(d +)         |                                                                   | vaddoubod |  |  |  |  |
| Move                                                        | m256d _mm256_a                                                                                         | lasub_pa (m256a       | a, <b>m256d</b> b) |                                                                   | vaddsubpd |  |  |  |  |

ParProg20 C1 Integrated Accelerators

Sven Köhler

Chart 40

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#



# **Autovectorization**

ParProg20 C1 Integrated Accelerators

Sven Köhler

# Enable Autovectorization and Logging (GCC)



-ftree-vectorize -m<arch>

enable automatic code vectorization (part of -03)

```
-fopt-info-vec[-optimized]
-fopt-info-vec-missed
-fopt-info-vec-note
-fopt-info-vec-all
```

log loops optimized. log loops failed to optimized detailed information. verbose info on loops and optimizations done enable all above

example4.c:14:10: optimized: loop vectorized using 16 byte vectors example4.c:9:6: note: vectorized 1 loops in function.

autovector.cpp:22:22: missed: couldn't vectorize loop autovector.cpp:25:14: missed: not vectorized: complicated access pattern. ParProg20 C1 Integrated Accelerators

# What loops can be vectorized



- Countable loops
- Static counts (length does not change)
- Single entry and single exit (read: no data-depended break)
- All function calls can be in-lined, or are math intrinsics (sin, floor, ...)
- Straight-line code (no switch-statements), mask-able if/continue

```
for (int i=0; i<length; i++) {
    float s = b[i]*b[i] - 4*a[i]*c[i];
    if ( s >= 0 ) {
        s = sqrt(s) ;
        x2[i] = (-b[i]+s)/(2.*a[i]);
        x1[i] = (-b[i]-s)/(2.*a[i]);
    } else {
        x2[i] = 0.;
        x1[i] = 0.;
    }
}
```

ParProg20 C1 Integrated Accelerators

```
Sven Köhler
```

# What cannot be vectorized



Non-contiguous Memory Accesses (often in nested loops)

- o for (int i=0; i<SIZE; i+=2) b[i] += a[i] \* x[i];</pre>
- o for (int i=0; i<SIZE; i+=2) b[i] += a[i] \* x[index[i]];</pre>
- Data dependencies within vector length
  - x[i] = x[i-1]\*2; (read-after-write)
  - x[i-1] = x[i] \*2; (write-after-read)
  - Except: sum = sum + x[j] \* y[j] (reduction)

ParProg20 C1 Integrated Accelerators

Sven Köhler

https://software.intel.com/sites/default/files/m/4/8/8/2/a/31848-CompilerAutovectorizationGuide.pdf https://software.intel.com/en-us/articles/common-vectorization-tips





# <Do you see a problem?> What happens if a, b, or c overlap? What if any of them is not aligned?

```
ParProg20 C1
Integrated
Accelerators
```

Sven Köhler



Digital Engineering • Universität Potsdam

# And now for a break and a cup of Ceylon with milk\*.

