

# VECTORIZATION FOR INTEL® C++ & FORTRAN COMPILER

#### Agenda

- Introduction to SIMD for Intel<sup>®</sup> Architecture
- Vector Code Generation
- Compiler & Vectorization
- Validating Vectorization Success
- Reasons for Vectorization Fails
- Summary

6

#### Vectorization

- Single Instruction Multiple Data (SIMD):
  - Processing vector with a single operation
  - Provides data level parallelism (DLP)
  - Because of DLP more efficient than scalar processing
- Vector:
  - Consists of more than one element
  - Elements are of same scalar data types (e.g. floats, integers, ...)
- Vector length (VL): Elements of the vector



#### **Optimization Notice**

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

### SIMD & Intel<sup>®</sup> Architecture

#### • SIMD instructions:

- One single machine instruction for vector processing
- Vector lengths are fixed (2, 4, 8, 16)
- Synchronous execution on elements of vector(s)
   ⇒ Results are available at the same time
- Masking possible to omit operations on selected elements
- SIMD is key for data level parallelism for years:
  - 64 bit Multi-Media Extension (MMX<sup>™</sup>)
  - 128 bit Intel<sup>®</sup> Streaming SIMD Extensions (Intel<sup>®</sup> SSE, SSE2, SSE3, SSE4.1, SSE4.2) and Supplemental Streaming SIMD Extensions (SSSE3)
  - 256 bit Intel<sup>®</sup> Advanced Vector Extensions (Intel<sup>®</sup> AVX)
  - 512 bit vector instruction set extension of Intel<sup>®</sup> Many Integrated Core Architecture (Intel<sup>®</sup> MIC Architecture) and Intel<sup>®</sup> Advanced Vector Extensions 512 (Intel<sup>®</sup> AVX-512)

#### **SSE Vector Types**



Optimization Notice Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

(intel

#### SSE Packed vs. Scalar

- **Packed** SSE instructions operate on all elements per vector
- Most of these instructions have **scalar** versions operating only on one element of vector
- Avoid scalar versions and only use packed instructions to exploit SIMD capabilities!



Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. (intel

#### **AVX Vector Types**



Optimization Notice Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

(intel)

# Intel<sup>®</sup> MIC Architecture Vector Types



- High level language *complex* types can also be used, compiler cares about details (halves the potential vector length)
- Use 32 bit integers where possible, avoid 64 bit integers (short & char types will be converted implicitly, though)
- Masking supported via dedicated registers (K0-7)
   ⇒ No need for bit vectors or additional compute cycles

### Intel® AVX-512 Vector Types



#### ⇒ Combines AVX and Intel<sup>®</sup> MIC Architecture!

**Optimization Notice** 

N

Intel<sup>®</sup> AVX-51

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. (intel

### Intel<sup>®</sup> AVX-512 Registers

- Extended VEX encoding (EVEX) to introduce another prefix
- Extends previous AVX and SSE registers to 512 bit:



#### ⇒ No penalty when switching between XMM, YMM and ZMM!

#### **Optimization Notice**

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. íntel

# Intel® AVX-512 - Comparison

MPX,SHA, ... AVX-512VL KNL and future Intel<sup>®</sup> Xeon<sup>®</sup> processors share a large set of instructions AVX-512PR **AVX-512BW** AVX-512ER But some sets are not identical **AVX-512DO** AVX-512CD AVX-512CD Subsets are represented by individual feature flags (CPUID) **AVX-512F** AVX-512F Common Instruction Set AVX2 AVX2 AVX2 AVX AVX AVX AVX SSE SSE SSE SSE SSE NHM SNB HSW Future Knight Future Intel® (KNL) Xeon® processor

# **Operating Systems & Intel® AVX-512**

- OS support is required due to the new (extended) register state
- At least the following OSes are needed to get Intel<sup>®</sup> AVX:
  - Linux\* kernel 3.15 or latest
  - Microsoft Windows\* 8 and later
  - OS X\*: unknown

#### Without OS support Intel<sup>®</sup> AVX-512 cannot be used even though the underlying processor supports it!



#### Agenda

- Introduction to SIMD for Intel<sup>®</sup> Architecture
- Vector Code Generation
- Compiler & Vectorization
- Validating Vectorization Success
- Reasons for Vectorization Fails
- Summary

### Vectorization of Code

- Transform sequential code to exploit vector processing capabilities (SIMD) of Intel processors
  - Manually by explicit syntax
  - Automatically by tools like a compiler



Optimization Notice

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others (intel

#### **Use Vectorization**

- How to express vectorization?
  - Fortran and C/C++ have limited ways to express it
  - But, Intel compilers use heuristics to vectorize
  - There are extensions that allow expression of vectorization explicitly
  - There are other, less portable ways...
- Select SIMD type:
  - A specific SSE/AVX version also includes all previous versions
  - Prefer AVX to SSE if available and possible; AVX also includes SSE
  - Avoid mixing SSE and AVX when using intrinsics or direct assembly
  - If target platform is not fixed/known Intel compiler can help producing multiple versions for different SIMD types:
     Runtime processor dispatching

### Many Ways to Vectorize

Compiler: Auto-vectorization (no change of code)

Compiler: Auto-vectorization hints (#pragma vector, ...)

> Compiler: OpenMP\* 4.0 and Intel<sup>®</sup> Cilk<sup>™</sup> Plus

> > SIMD intrinsic class (e.g.: F32vec, F64vec, ...)

Vector intrinsic
(e.g.: \_mm\_fmadd\_pd(...), \_mm\_add\_ps(...), ...)

Assembler code (e.g.: [v] addps, [v] addss, ...)

Programmer control

Ease of use

#### **Optimization Notice**

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others ínte

### Agenda

- Introduction to SIMD for Intel<sup>®</sup> Architecture
- Vector Code Generation
- Compiler & Vectorization
- Validating Vectorization Success
- Reasons for Vectorization Fails
- Summary

#### Many Ways to Vectorize



Compiler: Auto-vectorization hints (#pragma vector, ...)

> Compiler: OpenMP\* 4.0 and Intel<sup>®</sup> Cilk<sup>™</sup> Plus

> > SIMD intrinsic class (e.g.: F32vec, F64vec, ...)

Vector intrinsic
(e.g.: \_mm\_fmadd\_pd(...), \_mm\_add\_ps(...), ...)

Assembler code (e.g.: [v] addps, [v] addss, ...)

Programmer control

Ease of use

#### **Optimization Notice**

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others 'inte

#### Auto-vectorization of Intel Compilers



Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

inte

### Vectorization with Language Extensions

- Using vector intrinsic or SIMD intrinsic class inherently provides a guarantee of using SIMD instructions
   However, this is highly platform dependent and complex
- Auto-vectorization mostly works out of the box
   However, there are cases auto-vectorization does not work
- Intel<sup>®</sup> Cilk<sup>™</sup> Plus Array Notation Extensions can alleviate both problems (C/C++ only):
  - Deterministically making use of vectorization



OpenMP\* 4.0 also provides extensions for vectorization (C/C++ & Fortran)

Optimization Notice

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others inte

#### **Basic Vectorization Switches I**

- Linux\*, OS X\*: -x<feature>, Windows\*: /Qx<feature>
  - Might enable Intel processor specific optimizations
  - Processor-check added to "main" routine: Application errors in case SIMD feature missing or non-Intel processor with appropriate/informative message
- Linux\*, OS X\*: -ax<features>, Windows\*: /Qax<features>
  - Multiple code paths: baseline and optimized/processor-specific
  - Optimized code paths for Intel processors defined by <features>
  - Multiple SIMD features/paths possible, e.g.: -axSSE2, AVX
  - Baseline code path defaults to -msse2 (/arch:sse2)
  - The baseline code path can be modified by -m<feature> or -x<feature> (/arch:<feature> or /Qx<feature>)

#### **Basic Vectorization Switches II**

- Linux\*, OS X\*: -m<feature>, Windows\*: /arch:<feature>
  - Neither check nor specific optimizations for Intel processors: Application optimized for both Intel and non-Intel processors for selected SIMD feature
  - Missing check can cause application to fail in case extension not available
- Default for Linux\*: -msse2, Windows\*: /arch:sse2:
  - Activated implicitly
  - Implies the need for a target processor with at least Intel<sup>®</sup> SSE2
- Default for OS X\*: -msse3 (IA-32), -mssse3 (Intel<sup>®</sup> 64)
- For 32 bit compilation, **-mia32** (**/arch:ia32**) can be used in case target processor does not support Intel<sup>®</sup> SSE2 (e.g. Intel<sup>®</sup> Pentium<sup>®</sup> 3 or older)

#### **Basic Vectorization Switches III**

- Special switch for Linux\*, OS X\*: -xHost, Windows\*: /QxHost
  - Compiler checks SIMD features of current host processor (where built on) and makes use of latest SIMD feature available
  - Code only executes on processors with same SIMD feature or later as on build host
  - As for -x<feature> or /Qx<feature>, if "main" routine is built with -xHost or /QxHost the final executable only runs on Intel processors

#### **Vectorization Pragma/Directive**

- SIMD features can also be set on a function/subroutine level via pragmas/directives:
  - C/C++: #pragma intel optimization parameter target arch=<CPU>
  - Fortran:
     !DIR\$ ATTRIBUTES OPTIMIZATION\_PARAMETER:TARGET\_ARCH= <CPU>
- Examples:



### **Control Vectorization I**

- Disable vectorization:
  - Globally via switch: Linux\*, OS X\*: -no-vec, Windows\*: /Qvec-
  - For a single loop:
     C/C++: #pragma novector, Fortran: !DIR\$ NOVECTOR
  - Compiler still can use some SIMD features
- Using vectorization:
  - Globally via switch (default for optimization level 2 and higher): Linux\*, OS X\*: -vec, Windows\*: /Qvec
  - Enforce for a single loop (override compiler efficiency heuristic) if semantically correct:
     C/C++: #pragma vector always, Fortran: !DIR\$ VECTOR ALWAYS
  - Influence efficiency heuristics threshold: Linux\*, OS X\*: -vec-threshold[n]
     Windows\*: /Qvec-threshold[[:]n]
     n: 100 (default; only if profitable) ... 0 (always)



### **Control Vectorization II**

- Verify vectorization:
  - Globally: Linux\*, OS X\*: -opt-repot, Windows\*: /Qopt-report
  - Abort compilation if loop cannot be vectorized:
     C/C++: #pragma vector always assert
     Fortran: !DIR\$ VECTOR ALWAYS ASSERT
- Advanced:
  - Ignore vector dependencies (IVDEP): C/C++: #pragma ivdep Fortran: !DIR\$ IVDEP
  - "Enforce" vectorization: C/C++: #pragma simd or #pragma omp simd Fortran: !DIR\$ SIMD or !\$OMP SIMD

When used, vectorization can only be turned off with: Linux\*, OS X\*: -no-vec -no-simd -qno-openmp-simd Windows\*: /Qvec- /Qsimd- /Qopenmp-simd-



#### Agenda

- Introduction to SIMD for Intel<sup>®</sup> Architecture
- Vector Code Generation
- Compiler & Vectorization
- Validating Vectorization Success
- Reasons for Vectorization Fails
- Summary

# Validating Vectorization Success I

- Assembler code inspection (Linux\*, OS X\*: -S, Windows\*: /Fa):
  - Most reliable way and gives all details of course
  - Check for scalar/packed or (E)VEX encoded instructions: Assembler listing contains source line numbers for easier navigation
- Using Intel<sup>®</sup> VTune<sup>™</sup> Amplifier:
  - Different events can be selected to measure use of vector units, e.g.
     FP\_COMP\_OPS\_EXE.SSE\_PACKED\_[SINGLE|DOUBLE]
  - For Intel<sup>®</sup> MIC Architecture: Use metric Vectorization Intensity

#### • Difference method:

- Compile and benchmark with -no-vec -no-simd -qno-openmp-simd or / Qvec-/Qsimd- /Qopenmp-simd-, or on a loop by loop basis via #pragma novector or !DIR\$ NOVECTOR
- 2. Compile and benchmark with selected SIMD feature
- 3. Compare runtime differences

# Validating Vectorization Success II

#### Optimization report:

- Linux\*, OS X\*: -opt-report=<n>, Windows\*: /Qopt-report:<n> n: 0, ..., 5 specifies level of detail; 2 is default (more later)
- Prints optimization report with vectorization analysis
- Also known as vectorization report for Intel<sup>®</sup> C++/Fortran Compiler before 15.0: Linux\*, OS X\*: -vec-report=<n>, Windows\*: /Qvec-report:<n>
   Deprecated, don't use anymore – use optimization report instead!
- Optimization report phase:
  - Linux\*, OS X\*: -opt-report-phase=, Windows\*: /Qopt-report-phase:
  - is all by default; use vec for just the vectorization report
- Optimization report file:
  - Linux\*, OS X\*: -opt-report-file=<f>, Windows\*: /Qopt-report-file:<f>
  - <f> can be stderr, stdout or a file (default: \*.optrpt)

**Optimization Notice** 

#### **Optimization Report Example**

Example novec.f90:

```
1: subroutine fd(y)
2: integer :: i
3: real, dimension(10), intent(inout) :: y
4: do i=2,10
5: y(i) = y(i-1) + 1
6: end do
7: end subroutine fd
```

```
$ ifort novec.f90 -opt-report=5
ifort: remark #10397: optimization reports are generated in *.optrpt
files in the output location
$ cat novec.optrpt
...
LOOP BEGIN at novec.f90(4,5)
    remark #15344: loop was not vectorized: vector dependence prevents
vectorization
```

```
remark #15346: vector dependence: assumed FLOW dependence between y
line 5 and y line 5
remark #25436: completely unrolled by 9
LOOP END
```

```
•••
```

### **Optimization Report – Advanced I**

• See which levels are available for each phase:

Linux\*, OS X\*: -qopt-report-help, Windows\*: /Qopt-report-help

| <pre>\$ icpc -qopt-report-help</pre>                          |
|---------------------------------------------------------------|
| wec: Vector optimizations                                     |
| Level 1: Report the loops that were vectorized.               |
| Level 2: Level 1 + report the loops that were not vectorized, |
| along with reason preventing vectorization.                   |
| Level 3: Level 2 + loop vectorization summary.                |
| Level 4: Level 3 + report verbose details for reasons loop    |
| was/wasn't vectorized.                                        |
| Level 5: Level 4 + report information about variable/memory   |
| dependencies preventing vectorization.                        |
|                                                               |

#### • Select format:

- Linux\*, OS X\*: -qopt-report-format=[text|vs], Windows\*: /Qopt-report-format: [text|vs]
- text as textual and vs for Microsoft Visual Studio\* IDE integration output

**Optimization Notice** 

### Agenda

- Introduction to SIMD for Intel<sup>®</sup> Architecture
- Vector Code Generation
- **Compiler & Vectorization**
- Validating Vectorization Success
- **Reasons for Vectorization Fails**
- Summary ۲



# Reasons for Vectorization Fails I

#### Most frequent reasons:

- Data dependence
- Alignment
- Unsupported loop structure
- Non-unit stride access
- Function calls/in-lining
- Non-vectorizable Mathematical functions
- Data types
- Control depencence
- Bit masking

#### All those are common and will be explained in detail next!

**Optimization Notice** 

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.



### Reasons for Vectorization Fails II

#### **Other reasons:**

- Outer loop of loop nesting cannot be vectorized
- Loop body too complex (register pressure)
- Vectorization seems inefficient (low trip count)
- Many more

Those are less likely and are not described in the following!

# Factors that prevent Vectorizing your code

1. Loop-carried dependencies

DO I = 1, N A(I + M) = A(I) + B(I)ENDDO

1.A Pointer aliasing (compiler-specific)

2. Function calls (incl. indirect)

```
for (i = 1; i < nx; i++) {
    x = x0 + i * h;
    sumx = sumx + func(x, y, xp);
}</pre>
```

3. Loop structure, boundary condition

```
struct _x { int d; int bound; };
void doit(int *a, struct _x *x)
{
  for(int i = 0; i < x->bound; i++)
      a[i] = 0;
}
```

4 Outer vs. inner loops

```
for(i = 0; i <= MAX; i++) {
  for(j = 0; j <= MAX; j++) {
    D[j][i] += 1;
  }
}</pre>
```

5. Cost-benefit (compiler specific..)

```
And others.....
```





### Factors that slow-down your Vectorized code

1.A. Indirect memory access

1.B Memory sub-system Latency / Throughput

```
void scale(int *a, int *b)
{
   for (int i = 0; i < VERY_BIG; i++)
        c[i] = z * a[i][j];
        b[i] = z * a[i];
}</pre>
```

2. Serialized or "sub-optimal" function calls

```
for (i = 1; i < nx; i++) {
    sumx = sumx +
    serialized_func_call(x,
y, xp);
}</pre>
```

3. Small trip counts not multiple of VL



4. Branchy codes, outer vs. inner loops

5. MANY others: spill/fill, fp accuracy trade-offs, FMA, DIV/SQRT, Unrolling, even AVX throttling..



### Data Dependence

#### **Definition of data dependence:**

There is a data dependence from statement  $S_1$  to statement  $S_2$  (written as  $S_1 \delta S_2$ ) if and only if:

- There is a potential execution flow from S<sub>1</sub> to S<sub>2</sub>
- S<sub>1</sub> and S<sub>2</sub> reference a common memory location S<sub>1</sub> or S<sub>2</sub> write to

Note:  $S_1$  and  $S_2$  can be the very same statement

#### Data dependence classification:

•  $S_1 \delta^F S_2$ :  $S_1$  writes,  $S_2$  reads: **Flow Dependence** 



 $S_1 \delta^0 S_2$ :  $S_1$  writes,  $S_2$  writes: **Output Dependence** 

 $\begin{array}{cccc} \mathbf{S}_1 & \mathbf{X} &= & \dots \\ \mathbf{S}_2 & \dots &= & \mathbf{X} \end{array}$ 



$$\begin{array}{ccc} \mathbf{S}_1 & \mathbf{X} &= & \dots \\ \mathbf{S}_2 & \mathbf{X} &= & \dots \end{array}$$

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others

Optimization Notice

ínte

### Data Dependence in Loops

Dependencies in loops become more obvious by virtually unrolling the loop:

DO I = 1, N  

$$S_1 = A(I+1) = A(I) + B(I)$$
  
ENDDO  
 $S_1 \delta^F S_1$ 
 $A(2) = A(1) + B(1)$   
 $S_1 A(3) = A(2) + B(2)$   
 $S_1 A(4) = A(3) + B(3)$   
 $S_1 A(5) = A(4) + B(4)$   
...

In case the dependency requires execution of any previous loop iteration, we call it **loop-carried dependence**. Otherwise, **loop-independent dependence**.

E.g.:



 $S_1 \delta^F S_2$ : Loop-independent dependence

$$S_2 \delta^F S_2$$
: Loop-carried dependence

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

### **Disambiguation Hints I**

- Disambiguating memory locations of pointers in C99: Linux\*, OS X\*: -std=c99, Windows\*: /Qstd=c99
- Intel<sup>®</sup> C++ Compiler also allows this for other modes

   (e.g. -std=c89, -std=c++0x, ...), too not standardized, though:
   Linux\*, OS X\*: -restrict, Windows\*: /Qrestrict
- Declaring pointers with keyword **restrict** asserts compiler that they only reference individually assigned, non-overlapping memory areas
- Also true for any result of pointer arithmetic (e.g. ptr + 1 or ptr[1])

```
Examples:
void scale(int *a, int *restrict b)
{
    for (int i = 0; i < 10000; i++) b[i] = z * a[i];
}
void mult(int a[][NUM], int b[restrict][NUM])
{ ... }</pre>
```

Optimization Notice Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others

### **Disambiguation Hints II**

#### **Directives:**

- **#pragma ivdep** (C/C++) or **!DIR\$ IVDEP** (Fortran)
- #pragma simd (C/C++) or !DIR\$ SIMD (Fortran)

#### For C/C++:

- Assume no aliasing at all (dangerous!): Linux\*, OS X\*: -fno-alias, Windows\*: /Oa
- Assume ISO C Standard aliasing rules: Linux\*, OS X\*: -ansi-alias, Windows\*: /Qansi-alias
   Default with 15.0 and later but not with earlier versions!
- Turns on ANSI aliasing checker, too (thus recommended)
- No aliasing between function arguments: Linux\*, OS X\*: -fargument-noalias, Windows\*: /Qalias-args-
- No aliasing between function arguments and global storage: Linux\*, OS X\*: -fargument-noalias-global, Windows\*: N/A



### **Disambiguation Hints III**

#### For Fortran:

- Assume no aliasing at all: Linux\*, OS X\*: -fno-alias, Windows\*: /Oa
- Assume Fortran Standard aliasing rules: Linux\*, OS X\*: -ansi-alias, Windows\*: /Qansi-alias
   Opposed to C/C++ this is default since ever!
- No aliasing of Cray\* pointers: Linux\*, OS X\*: -safe-cray-ptr, Windows\*: /Qsafe-cray-ptr



### Inter-Procedural Dependency Analysis

- Optimization usually takes place individually for each procedure
- Dependency analysis of inter-procedural optimization (IPO) works across all procedures and thus allows **global optimization**
- Switch to turn on IPO for single file (one compilation unit)
  - Linux\*, OS X\*: -ip
  - Windows\*: /Qip
     Subset already default for optimization levels 2 and higher
- Switch to turn on IPO for all compilation units
  - Linux\*, OS X\*: -ipo
  - Windows\*: /Qipo
- Example: References of function arguments can be analyzed even if located in other compilation unit.



### Alignment

Caveat with using unaligned memory access:

- Unaligned loads and stores can be very slow due to higher I/O because two cache-lines need to be loaded/stored (not always, though)
- Compiler can mitigate expensive unaligned memory operations by using two partial loads/stores – still slow (e.g. two 64 bit loads instead of one 128 bit unaligned load)
- The compiler can use "versioning" in case alignment is unclear: Run time checks for alignment to use fast aligned operations if possible, the slower operations otherwise – better but limited

Best performance: User defined aligned memory

- 16 byte for SSE
- 32 byte for AVX
- 64 byte for Intel<sup>®</sup> MIC Architecture & Intel<sup>®</sup> AVX-512



### Alignment Hints for C/C++ I

- Aligned heap memory allocation by intrinsic/library call:
  - void\* \_mm\_malloc(int size, int base)
  - Linux\*, OS X\* only: int posix\_memaligned(void \*\*p, size\_t base, size\_t size)
- #pragma vector [aligned|unaligned]
  - Only for Intel Compiler
  - Asserts compiler that aligned memory operations can be used for all data accesses in loop following directive
  - Use with care: The assertion must be satisfied for all(!) data accesses in the loop!



### Alignment Hints for C/C++ II

- Align attribute for variable declarations:
  - Linux\*, OS X\*, Windows\*: \_\_\_declspec(align(base)) <var>
  - Linux\*, OS X\*: <var> \_\_attribute\_\_((aligned(base)))
  - Portability caveat: <u>declspec</u> is not known for GCC and <u>attribute</u> not for Microsoft Visual Studio\*!
- Hint that start address of an array is aligned (Intel Compiler only): \_\_assume\_aligned(<array>, base)



## **Alignment Hints for Fortran**

• !DIR\$ VECTOR [ALIGNED|UNALIGNED]

- Asserts compiler that aligned memory operations can be used for all data accesses in loop following directive
- Use with care: The assertion must be satisfied for all(!) data accesses in the loop!
- Hint that an entity in memory is aligned:
   !DIR\$ ASSUME\_ALIGNED address1:base [, address2:base] ...
- Align variables:
   !DIR\$ ATTRIBUTES ALIGN: base :: variable
- Align data items globally: Linux\*, OS X\*: -align <a>, Windows\*: /align:<a>
  - <a> can be array<n>byte with <n> defining the alignment for arrays
  - Other values for <a> are also possible, e.g.: [no] commons, [no] records, ...

### All are Intel<sup>®</sup> Fortran Compiler only directives and options!

**Optimization Notice** 



### Alignment Impact: Example

#### Compiled both cases using **-xAVX**:

| void mult(double* a, double* b, double* c)                                                                        |                                               |                                                                                                                                                       |
|-------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>{     int i;     #pragma vector unaligned     for (i = 0; i &lt; N; i++)         c[i] = a[i] * b[i]; }</pre> | vmovupd<br>vinsertf12<br>vinsertf12<br>vmulpd | (%rdi,%rax,8), %xmm0<br>(%rsi,%rax,8), %xmm1<br>28 \$1, 16(%rsi,%rax,8), %ymm1, %ymm3<br>28 \$1, 16(%rdi,%rax,8), %ymm0, %ymm2<br>%ymm3, %ymm2, %ymm4 |
| More efficient if aligned:                                                                                        | vextractf1<br>addq<br>cmpq                    | <pre>%xmm4, (%rdx,%rax,8) 128 \$1, %ymm4, 16(%rdx,%rax,8) \$4, %rax \$1000000, %raxB2.2</pre>                                                         |
| void mult(double* a, double* b, double* c)                                                                        |                                               |                                                                                                                                                       |
| <pre>{     int i;     #pragma vector aligned     for (i = 0; i &lt; N; i++)         c[i] = a[i] * b[i]; }</pre>   | vmulpd                                        | <pre>(%rdi,%rax,8), %ymm0 (%rsi,%rax,8), %ymm0, %ymm1 %ymm1, (%rdx,%rax,8) \$4, %rax \$1000000, %raxB2.2</pre>                                        |
|                                                                                                                   | ~                                             |                                                                                                                                                       |

Optimization Notice

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. (intel)

### **Unsupported Loop Structure**

- Loops where compiler does not know the iteration count:
  - Upper/lower bound of a loop are not loop-invariant
  - Loop stride is not constant
  - Early bail-out during iterations (e.g. break, exceptions, etc.)
  - Too complex loop body conditions for which no SIMD feature instruction exists
  - Loop dependent parameters are globally modifiable during iteration (language standards require load and test for each iteration)

```
• Transform is possible, e.g.:
struct _x { int d; int bound; };
void doit(int *a, struct _x *x)
{
  for(int i = 0; i < x->bound; i++)
    a[i] = 0;
}
```

```
struct _x { int d; int bound; };
void doit(int *a, struct _x *x)
{
    int local_ub = x->bound;
    for(int i = 0; i < local_ub; i++)
        a[i] = 0;
}</pre>
```

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others

### **Non-Unit Stride Access**

- Non-consecutive memory locations are being accessed in the loop
- Vectorization works best with contiguous memory accesses
- Vectorization still be possible for non-contiguous memory access, but...
  - Data arrangement operations might be too expensive (e.g. access pattern linear/regular)
  - Vectorization report issued when too expensive: Loop was not vectorized: vectorization possible but seems inefficient
- Examples:

### Function Calls/In-lining I

- Function calls prevent vectorization in general
- Exceptions:
  - Call of intrinsic routines such as mathematical functions: Implementation is known to compiler
  - Successful in-lining of called routine: IPO enables in-lining of routines across source files

```
for (i = 1; i < nx; i++) {
  x = x0 + i * h;
  sumx = sumx + func(x, y, xp, yp);
}
// Defined in different compilation unit!
float func(float x, float y, float xp, float yp)
{
  float denom;
  denom = (x - xp) * (x - xp) + (y - yp) * (y - yp);
  denom = 1. / sqrt(denom);
  return denom;</pre>
```

### Function Calls/In-lining II

- Success of in-lining can be verified using the optimization report: Linux\*, OS X\*: -opt-report=<n> -opt-report-phase=ipo Windows\*: /Qopt-report:<n> /Qopt-report-phase:ipo
- Intel compilers offer a large set of switches, directives and language extensions to control in-lining globally or locally, e.g.:
  - #pragma [no]inline (C/C++), !DIR\$ [NO]iNLINE (Fortran): Instructs compiler that all calls in the following statement can be in-lined or may never be inlined
  - #pragma forceinline (C/C++), !DIR\$ FORCEINLINE (Fortran): Instructs compiler to ignore the heuristic for in-lining and to inline all calls in the following statement
  - See section "Inlining Options" in compiler manual for full list of options
- IPO offers additional advantages to vectorization
  - Inter-procedural alignment analysis
  - Improved (more precise) dependency analysis

### How to Succeed in Vectorization? II

- Non-unit stride between elements: Possible to change algorithm to allow linear/consecutive access?
- **Loop body too complex reports**: Try splitting up the loops!
- **Vectorization seems inefficient reports:** Enforce vectorization, benchmark and verify results!



### Agenda

- Introduction to SIMD for Intel<sup>®</sup> Architecture
- Vector Code Generation
- Compiler & Vectorization
- Validating Vectorization Success
- Reasons for Vectorization Fails
- Summary



### Summary

- Intel<sup>®</sup> C++ Compiler and Intel<sup>®</sup> Fortran Compiler provide sophisticated and flexible support for vectorization
- They also provide a rich set of reporting features that help verifying vectorization and optimization in general
- Directives and compiler switches permit fine-tuning for vectorization
- Vectorization can even be enforced for certain cases where language standards are too restrictive
- Understanding of concepts like dependency and alignment is required to take advantage from SIMD features
- Intel<sup>®</sup> C++/Fortran Compiler can create multi-version code to address a broad range of processor generations, Intel and non-Intel processors and individually exploiting their feature set

### References

- Aart Bik: "The Software Vectorization Handbook" <u>http://www.intel.com/intelpress/sum\_vmmx.htm</u>
- Randy Allen, Ken Kennedy: "Optimizing Compilers for Modern Architectures: A Dependence-based Approach"
- Steven S. Muchnik, "Advanced Compiler Design and Implementation"
- Intel Software Forums, Knowledge Base, White Papers, Tools Support (see <u>http://software.intel.com</u>) Sample Articles:
  - http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intelc-compilers/
  - http://software.intel.com/en-us/articles/requirements-for-vectorizable-loops/
  - http://software.intel.com/en-us/articles/performance-tools-for-softwaredevelopers-intel-compiler-options-for-sse-generation-and-processor-specificoptimizations/



# **THANK YOU!**

### Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

