# Performance Analysis using the Roofline Model

Samuel Williams (SWWilliams@lbl.gov), Charlene Yang, Khaled Ibrahim, Thorsten Kurth, Nan Ding, Jack Deslippe, Leonid Oliker CRD/NERSC, Lawrence Berkeley National Laboratory

#### Introduction

- Roofline is a throughputoriented performance model
- Tracks rates not times
- Independent of ISA and architecture
- applies to CPUs, GPUs, Google TPUs, FPGAs, etc...
- Defines <u>Good Performance</u>



- Can be very different from total loads/stores (bytes requested)
- Equal to ratio of sustained GFLOP/s to
- sustained GB/s (time cancels)



- hierarchy on both CPUs and GPUs Different data movements for L2/HBM/PCIe imply different
- arithmetic intensities Differences in L2/HBM/PCle intensity highlight differences in locality (similar Al's imply streaming)
- Focus on important Loops, Kernels, Applications, ...
  - benefit from optimization
  - Users can use Roofline to identify underperforming loops/kernels/apps







# loops/kernels/apps attaining better

than 50% of Roofline will see limited



### Scaling Trajectories

- Performance thread § function provides, § concurrency little insight
- Need better approach = understand turn a overs in performance



Use Roofline to analyze thread scalability

#### "Roofline Scaling Trajectories"

- o 2D scatter plot of performance as a function of intensity and concurrency
- Identify loss in performance due to increased cache pressure (data movement)
- NAS Parallel Benchmarks
- Intensity (data movement) varies with concurrency and problem size
- Large problems (green and move more data per exhaust cache = thread, and capacity



- Falling Intensity → hit the bandwidth ceiling quickly and degrade.
- > Useful for understanding locality/BW contention induced scaling bottlenecks

#### Roofline on GPUs

- Developed a Roofline methodology POC analyzing applications running on NVIDIA GPUs
- Use NVProf to collect Roofline-related metrics (FLOPs, cache/DRAM data movement, etc...)
- BerkeleyGW (Materials) https://github.com/cyanguwa/BerkeleyGW-GPP
- nw increases data reuse in inner loop
- More flops for fixed data movement
- Understand cache effects
- Quantify effects of FMA:MUL ratio (disable FMA in compiler)
- Observations...
  - High correlation with HBM BW FMA doesn't hit FMA ceiling
  - High RF and L2 Locality
  - Minimal increases in L1 locality



- **HPGMG (Multigrid)** https://bitbucket.org/hpama/hpama
- GSRB Multiple variants of smoother...
  - o GSRB FP does 2x the work but is trivial to implement

Arithmetic Intensity [FLOPs/Byte]

- o STRIDE2 requires more complex memory access and predication
- Observations...
  - High correlation with HBM BW for large problem sizes (level>5)
  - Moderate L1 cache locality Low reuse in the L2 cache for
- **GSRB FP variant**
- o STRIDE2 performance crashes due to decline in intensity

#### Roofline for TensorFlow

- conv2d methodology using Demonstrate TensorFlow+cuDNN on V100 GPU
- Setup...
  - input\_image = tf.random\_uniform(shape=input\_size, minval=0., maxval=1., output result = conv2d(input image, 'NHWC', kernel size, stride size, dtype)
- Forward Pass (2D conv)
- Backward Pass (2D conv + derivative) = tf.train.GradientDescentOptimizer(0.5) exec\_op = opt.compute\_gradients(output\_result)
- Each kernel includes multiple sub-kernels
  - o Padding, permutations, conversions, compute, etc...
  - Should include all of them when analyzing performance
- TensorFlow also includes an autotuning step
- Ignore autotuning when profiling/modeling
- nvprof --profile-from-start off
- run 5 warmup iterations (autotuning / not profiled)
- start profiler (pyc.driver.start\_profiler), run 20 iter, stop profiler
- Vary parameters to understand performance

#### conv2d Forward Pass



## Integration in Intel Advisor

- Roofline has been integrated into Intel's Advisor Performance Tool...
  - ✓ Automatically instruments applications
  - (one dot per loop nest/function) ✓ Computes FLOPS and AI for each function / loop nest
  - Integrated Cache Simulator (hierarchical roofline)
  - Automatically benchmarks target system (calculates ceilings) ✓ AVX-512 support including vector masks
- ✓ Full integration with existing Advisor capabilities
- Fully supported on NERSC's Edison and Cori (Haswell and Knights Landing) Systems
- http://www.nersc.gov/users/software/performanceand-debugging-tools/advisor/
  - % module load advisor/2018.integrated\_roofline
  - % cc -g -dynamic -openmp -O2 -o mycode.exe mycode.c
  - % source advixe-vars.sh
  - % advixe-cl -collect survey --project-dir ./your project --<your-executable-with-parameters> % advixe-cl -collect tripcounts -enable-cache-simulation -

flop --project-dir ./your project -- <your-executable-with-





Traditional FLOP Roofline is irrelevant (no FLOPs)

 Advisor Roofline support expanded to include Integer and Integer+FLOP Rooflines



### **Community Engagement**

- Strong collaboration with NERSC, Intel, and NVIDIA
- We've run Roofline tutorials at sc'17, sc'18, sc'19, ECP'18, ECP'19, ISC'18, ISC'19, NERSC, etc...

#### **Publications**

- https://crd.lbl.gov/roofline/publications
- "Hierarchical Williams, **GPUs:** Accelerating Roofline Analysis for Performance Optimization for **NERSC-9** Perlmutter System", CUG, 2019.
- C. Yang, S. Williams, "Performance Analysis of **GPU-Accelerated Applications using the Roofline** Model", GTC, 2019.
- Yang, et al., "An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability", P3HPC, 2018.
- K. Ibrahim, S. Williams, L. Oliker, "Roofline Scaling" Trajectories: A Method for Parallel Application **Architectural Performance** Analysis", HPBench, 2018.
- T. Koskela, et al., "A Novel Multi-Level Integrated Roofline Model Approach for Performance Characterization", ISC, 2018.







