### Introduction to the Roofline Model







#### **Charlene Yang** Lawrence Berkeley National Laboratory Jun 16 2019, Frankfurt





### **Performance Models**







#### The Maze of Performance Optimization

#### The Map !!!





### **Performance Models**



#### Modern architectures are complicated!







### **Performance Models**



- Many components contribute to the kernel run time
- An interplay of application characteristics and machine characteristics





#### **Roofline Performance Model**

Sustainable performance is bound by

GFLOP/s = min { Peak GFLOP/s AI \* Peak GB/s

• Arithmetic Intensity (AI) =

**FLOPs / Bytes** 

How did this come about?
 → A CPU DRAM example







- One could hope to always attain peak performance (FLOP/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)

Time = max #Bytes / Peak GB/s







6

- One could hope to always attain peak performance (FLOP/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)



7





CPU (compute, FLOP/s)

- One could hope to always attain peak performance (FLOP/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:
  - Idealized processor/caches
  - Cold start (data in DRAM)











- One could hope to always attain peak performance (FLOP/s)
- However, finite locality (reuse) and bandwidth limit performance.
- Assume:

Office of

Science

- Idealized processor/caches
- Cold start (data in DRAM)

Arithmetic Intensity (AI) = FLOPs / Bytes (as presented to DRAM )









### **Roofline Performance Model**

Office of

Science





- A throughput-oriented model
  - tracks rates not times, i.e. GFLOP/s, GB/s, not seconds
- An abstraction over
  - architectures, ISA (CPU, GPU, Haswell, KNL, Pascal, Volta)
  - programming models, programming languages
  - numerical algorithms, problem sizes
- In log-log scale to easily extrapolate performance along Moore's Law





# **More Advanced on Roofline**

















### **Roofline Performance Model**

- This is a single Roofline
- What about the memory hierarchy, different execution configurations, and instruction mixes?

→ Hierarchical Roofline
 → Multiple compute ceilings

Office of

Science







### **Hierarchical Roofline**

- Superposition of multiple Rooflines
  - Incorporate full memory hierarchy
  - Arithmetic Intensity = FLOPs / Bytes<sub>L1/L2/HBM/SysMem</sub>

- Each kernel will have multiple Al's but one observed GFLOP/s performance
- Peak GFLOP/s Peak GFLOP/s Peak GFLOP/s Peak GFLOP/s

Relise

**T**V100

Arithme

• Hierarchical Roofline tells you about cache locality









- threadblock/thread configuration
- SM occupancy

Impact of execution configuration

- load balance
- **OpenMP** thread concurrency

**Concurrency affects your peak** 



NERSC





•

•

•

Performance is bound by the actual concurrency ceiling

**T**CPU



### **Multiple Compute Ceilings**

- Impact of instruction mix
- Applications are usually a mix of FMA.f64, ADD.f64, MUL.f64...
- Performance is a weighted average
   ... bound by a partial FMA ceiling







# **Roofline Drives Optimization**

















### **Roofline Performance Model**

#### The Roofline Model

- helps you identify the bottlenecks
- guides you through optimization
- tells you when to stop

#### An example:

Office of

Science

• NESAP for Cori - BerkeleyGW

Haswell Roofline Optimization Path







#### **Optimization Path for Kernel-C (Sigma):**

- 1. Add OpenMP
- 2. Initial Vectorization
  - loop reordering
  - conditional removal
- 3. Cache-Blocking
- 4. Improved Vectorization
  - divides
- 5. Hyper-threading







#### Sigma Optimization Process



### **General Optimization Strategy**

 Broadly speaking, three approaches to improving performance:







# **General Optimization Strategy**

- Broadly speaking, three approaches to improving performance:
- Maximize compute performance
  - multithreading
  - vectorization

Office of

Science

- increase SM occupancy
- utilize FMA instructions
- minimize thread divergence





U.S. DEPARTMENT OF ENERGY Office of Science

### **General Optimization Strategy**

- Broadly speaking, three approaches to improving performance:
- Maximize compute performance
- Maximize memory bandwidth
  - utilize higher-level caches
  - NUMA-aware allocation
  - avoid H-D transfers
  - avoid uncoalesced memory access

22









23

## **General Optimization Strategy**

- Broadly speaking, three approaches to improving performance:
- Maximize compute performance
- Maximize memory bandwidth
- Improve AI

Office of

Science

- minimize data movement
- exploit cache reuse





# **Roofline Data Collection**

















### **Pen and Paper**

• Example #1: STREAM Triad

for(i=0;i<N;i++){
 Z[i] = X[i] + alpha\*Y[i];
}</pre>

- 2 FLOPs per iteration
- Transfer 24 bytes per iteration
  - read X[i], Y[i], and write Z[i]
- AI = 0.083 FLOPs per byte
- Memory bound

Office of

Science







#### AI = 0.44 FLOPs per byte Memory bound, but 5x the GFLOP/s rate

Example #2: 7-pt stencil



7 FLOPs; 8 memory references (7 reads, 1 store) per pt





#### **Pen and Paper**

Office of

Science





27

**Pen and Paper** 

- Not scalable for real-life applications
- Millions of lines of code; mix of different languages
- Complicated modern architecture
  - memory hierarchy, caching effects
  - ISA
- Different problem sizes





# \_\_\_\_\_

We Need Tools!



- Roofline ceilings
  - vendor specifications
  - empirical measurements
    - · ERT
    - <u>https://bitbucket.org/be</u>
       <u>rkeleylab/cs-roofline-</u>
       <u>toolkit</u>







### We Need Tools!





**Office of ENERGY** Office of Science



### **We Need Tools!**





#### **Require three raw measurements:**

- Runtime
- FLOPs
- Bytes (on each cache level)

#### In order to calculate AI and GFLOP/s:





# **Methodology to Construct Roofline**



- 1. Collect Roofline ceilings
  - **compute** (FMA/no FMA) and **bandwidth** (DRAM, L2, ...)
  - ERT: https://bitbucket.org/berkeleylab/cs-roofline-toolkit
- 2. Collect application performance
  - FLOPs, bytes (DRAM, L2, ...), runtime
  - SDE, VTune, LIKWID, Advisor, nvprof, ...
- 3. Plot Roofline with Python Matplotlib (or other tools of your preference)
  - arithmetic intensity, GFLOP/s performance, ceilings
  - example scripts: https://github.com/cyanguwa/nersc-roofline



# **Automated Data Collection**



















#### The not-so-automated way 1:

- Intel SDE for FLOPs (emulation)
- Intel VTune for DRAM bytes (HW counters)
- Runtime
- DRAM Roofline only
- Used by NESAP for Cori

Office of

Science

- NERSC Exascale Science Application Program
- <u>http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity/</u>







Partic (pre

G =







#### **DRAM Rooflines of NESAP Codes**



#### The not-so-automated way 2:

- LIWID for FLOPs and bytes
  - Both are based on HW counters
- Runtime
- Hierarchical Roofline
- Limited by quality of HW counters
- High-level characterization, no callstack









#### The fully automated way:

- Intel Advisor, Roofline feature
- Instrument applications automatically
  - one dot per loop nest/function
- FLOPs, bytes and runtime
- Hierarchical Roofline

Office of

Science

- Integrates with other Advisor capabilities
- Benchmarks target system



N 🖸 - - - - B - | 🗆 Une Segle-Treasled Roots \*



# **Data Collection on NVIDIA GPUs**



- Still very manual at this stage, but...
- Runtime:
  - Internal timers or nvprof --print-gpu-trace
- FLOPs:
  - DP/SP/HP counters and metrics, nvprof --metrics
     `flop\_count\_dp/sp/hp' Or `tensor\_precision\_fu\_utilization'
- Bytes for different cache levels:
  - Bytes = (read transactions + write transactions) x transaction size
  - nvprof --metrics `metric\_name' 0.g. gld/gst\_transactions
- Hierarchical Roofline







- The Roofline Model formulizes the interaction between machine characteristics and application characteristics, and guides optimization
  - Peak computational throughput and bandwidth
  - Arithmetic intensity, cache locality, instruction mix...
- Automate Roofline data collection
  - Intel CPUs
    - Intel SDE + Intel VTune, Intel Advisor
  - NVIDIA GPUs
    - nvprof, Nsight Compute









- S. Williams, A. Waterman and D. Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures," *Communications of the ACM*, vol. 52, no. 4, pp. 65–76, 2009
- LBNL CRD Roofline Research:

https://crd.lbl.gov/departments/computer-science/PAR/research/roofline

• Empirical Roofline Toolkit (ERT):

https://bitbucket.org/berkeleylab/cs-roofline-toolkit

 Python scripts for plotting manually-collected Roofline: https://github.com/cyanguwa/nersc-roofline/tree/master/Plotting







- This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.
- This material is based upon work supported by the DOE RAPIDS SciDAC Institute.
- This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02- 05CH11231.







#### **Thank You**



