

# Introduction to Performance Modeling

Charlene Yang, Thorsten Kurth

Application Performance Group, NERSC

cjyang@lbl.gov

#### Why Use Performance Models or Tools?

- Identify performance bottlenecks
- Motivate software optimizations
- Determine when we're done optimizing
  - Assess performance relative to machine capabilities
  - Motivate need for algorithmic changes
- Predict performance on future machines / architectures
  - Sets realistic expectations on performance for future procurements
  - Used for HW/SW Co-Design to ensure future architectures are well-suited for the computational needs of today's applications.



#### (DRAM) Roofline

- One could hope to always attain peak performance (Flop/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)







### (DRAM) Roofline

- One could hope to always attain peak performance (Flop/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)

Note, Arithmetic Intensity (AI) = Flops / Bytes (as presented to DRAM)





### (DRAM) Roofline

- Plot Roofline bound using Arithmetic Intensity as the x-axis
- Log-log scale makes it easy to doodle, extrapolate performance along Moore's Law, etc...
- Kernels with Al less than machine balance are ultimately DRAM bound (we'll refine this later...)





#### Roofline Example #1

- Typical machine balance is 5-10 flops per byte...
  - 40-80 flops per double to exploit compute capability
  - Artifact of technology and money
  - Unlikely to improve
- Consider STREAM Triad...

```
#pragma omp parallel for
for(i=0;i<N;i++){
    Z[i] = X[i] + alpha*Y[i];
}</pre>
```

- 2 flops per iteration
- Transfer 24 bytes per iteration (read X[i], Y[i], write Z[i])
- AI = 0.083 flops per byte == Memory bound





#### Roofline Example #2

- Conversely, 7-point constant coefficient stencil...
  - 7 flops
  - 8 memory references (7 reads, 1 store) per point
  - Cache can filter all but 1 read and 1 write per point
  - Al = 0.44 flops per byte == memory bound,
     but 5x the flop rate





#### **Hierarchical Roofline**

- Real processors have multiple levels of memory
  - Registers
  - L1, L2, L3 cache
  - MCDRAM/HBM (KNL/GPU device memory)
  - DDR (main memory)
  - NVRAM (non-volatile memory)
- Applications can have locality in each level
  - Unique data movements imply unique Al's
  - Moreover, each level will have a unique bandwidth



#### **Hierarchical Roofline**

- Construct superposition of Rooflines...
  - Measure a bandwidth
  - Measure AI for each level of memory
  - Although a loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum





 Broadly speaking, there are three approaches to improving performance:





- Broadly speaking, there are three approaches to improving performance:
- Maximize in-core performance (e.g. get compiler to vectorize)





- Broadly speaking, there are three approaches to improving performance:
- Maximize in-core performance (e.g. get compiler to vectorize)
- Maximize memory bandwidth (e.g. NUMA-aware allocation)





- Broadly speaking, there are three approaches to improving performance:
- Maximize in-core performance (e.g. get compiler to vectorize)
- Maximize memory bandwidth (e.g. NUMA-aware allocation)
- Minimize data movement (increase AI)





#### Initial Roofline Analysis of NESAP Codes





#### To construct a RL, we need tools...

- Use tools known/observed to work on NERSC's Cori (KNL, HSW)...
  - Used Intel SDE (Pin binary instrumentation + emulation) to create software Flop counters
  - Used Intel VTune performance tool (NERSC/Cray approved) to access uncore counters
- Accurate measurement of Flop's (HSW) and DRAM data movement (HSW and KNL)
- Used by NESAP (NERSC KNL application readiness project) to characterize apps on Cori...



http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity/



#### **Evaluation of LIKWID**

- LIKWID provides easy to use wrappers for measuring performance counters...
  - ✓ Works on NERSC production systems
  - ✓ Minimal overhead (<1%)</p>
  - ✓ Scalable in distributed memory (MPI-friendly)
  - ✓ Fast, high-level characterization
  - x No detailed timing breakdown or optimization advice
  - x Limited by quality of hardware performance counter implementation (garbage in/garbage out)
- Useful tool that complements other tools





#### **Intel Advisor**

#### Includes Roofline Automation...

- Automatically instruments applications (one dot per loop nest/function)
- ✓ Computes FLOPS and AI for each function (CARM)
- ✓ AVX-512 support that incorporates masks
- ✓ Integrated Cache Simulator¹ (hierarchical roofline / multiple Al's)
- Automatically benchmarks target system (calculates ceilings)
- Full integration with existing Advisor capabilities



http://www.nersc.gov/users/training/events/roofline-training-1182017-1192017







## Thank You