#### Performance Analysis with Roofline on GPUs

#### **ECP Annual Meeting 2019**



Charlene Yang Application Performance Group, NERSC

Email: cjyang@lbl.gov





# Outline

- Use ERT to obtain empirical Roofline ceilings
  - compute: FMA, no-FMA
  - bandwidth: system memory, device memory, L2, L1
- Use nvprof to obtain application performance
  - FLOPs: active non-predicated threads, divides-aware
  - bytes: read + write; system memory, device memory, L2, L1
  - runtime: --print-gpu-summary, --print-gpu-trace
- Plot Roofline with Python and Matplotlib
- Examples and analysis
  - GPP from BerkeleyGW: varying AI, FMA, strided memory access
  - HPGMG from Multi-Grid applications: thread divergence

**One Hierarchical Roofline** 

Two Examples









## **Measure Roofline Ceilings**







1e+04

1e+05

vidth (GB/s)

### **Roofline Ceilings**

- Empirical Roofline Toolkit (ERT)
- <u>https://bitbucket.org/berkeleylab/cs-roofline-toolkit/</u>
- Characterizes machines with highly tuned but real 'micro-kernels'
- Sweeps through a variety of configurations:
  - 1 data element per thread -> multiple
  - 1 FLOP operation per data element -> multiple
  - number of threadblocks/threads
  - number of trails, dataset sizes, etc
- Four components
  - Driver.c, Kernel.c, configuration script, and job script



1e+06

Working Set Size (bytes)

1e+07





1e+08

### **ERT Configuration**

U.S. DEPARTMENT OF



| Kernel.c                                                                                                                                                                                                                                          | Driver.c (uses some Macros from config.txt)                                                                    |  |  |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|--|--|
| loop over ntrails                                                                                                                                                                                                                                 | initialize MPI, CUDA                                                                                           |  |  |
| distribute dataset on threads and each<br>computes ERT_FLOPS                                                                                                                                                                                      | <pre>loop over dataset sizes &lt;= ERT_MEMORY_MAX loop over trial sizes &gt;= ERT_TRIALS_MIN cuideMemory</pre> |  |  |
| Kernel.h                                                                                                                                                                                                                                          | cudaMemcpy<br>start timer                                                                                      |  |  |
|                                                                                                                                                                                                                                                   | call kernel                                                                                                    |  |  |
| $ERT_FLOPS=1: a = b + c$                                                                                                                                                                                                                          | end timer                                                                                                      |  |  |
| ERT FLOPS=2: $a = a \times b + c$                                                                                                                                                                                                                 |                                                                                                                |  |  |
|                                                                                                                                                                                                                                                   |                                                                                                                |  |  |
| config.txt                                                                                                                                                                                                                                        | Job script                                                                                                     |  |  |
| config.txt<br>ERT_FLOPS 1,2,4,8,16,32,64,128,256                                                                                                                                                                                                  | Job script<br>./ert config.txt                                                                                 |  |  |
| config.txt         ERT_FLOPS       1,2,4,8,16,32,64,128,256         ERT_GPU_BLOCKS       80,160,320,640,1280,2560                                                                                                                                 | ./ert config.txt                                                                                               |  |  |
| <pre>config.txt ERT_FLOPS 1,2,4,8,16,32,64,128,256 ERT_GPU_BLOCKS 80,160,320,640,1280,2560 ERT_GPU_THREADS 64,128,256,512,1024</pre>                                                                                                              |                                                                                                                |  |  |
| config.txt         ERT_FLOPS       1,2,4,8,16,32,64,128,256         ERT_GPU_BLOCKS       80,160,320,640,1280,2560         ERT_GPU_THREADS       64,128,256,512,1024         ERT_MEMORY_MAX       1073741824                                       | ./ert config.txt                                                                                               |  |  |
| config.txt         ERT_FLOPS       1,2,4,8,16,32,64,128,256         ERT_GPU_BLOCKS       80,160,320,640,1280,2560         ERT_GPU_THREADS       64,128,256,512,1024         ERT_MEMORY_MAX       1073741824         ERT_WORKING_SET_MIN       128 | ./ert config.txt<br>ert (Python)                                                                               |  |  |
| config.txt         ERT_FLOPS       1,2,4,8,16,32,64,128,256         ERT_GPU_BLOCKS       80,160,320,640,1280,2560         ERT_GPU_THREADS       64,128,256,512,1024         ERT_MEMORY_MAX       1073741824                                       | ./ert config.txt<br>ert (Python)<br>create directories                                                         |  |  |

- 1. Empirical Roofline Toolkit. https://bitbucket.org/berkeleylab/cs-roofline-toolkit/
- Office of 2. Tutorial code. https://github.com/cyanguwa/nersc-roofline/



Science 3. Roofline documentation. https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/

### **ERT Caveats**



- Read-modify-write Polynomial on a vector
  - $ERT_FLOPS=1$ : a = b + c;  $ERT_FLOPS=2$ :  $a = a \times b + c$ ;
    - •••••

- May require an unroll-and-jam or large OOO window to hit peak
  - #pragma unroll 8
- Uses 1:1 Read:Write ratio
  - ERT\_FLOPS=1: a = b + c
  - May underestimate aggregate cache bandwidth on architectures with 2:1 ratio
- Labels the largest/slowest bandwith 'DRAM' and the smallest/fastest 'L1'
  - May label L2 as 'L1' on architectures with write-through





### **Peak Bandwidths**



- NVIDIA V100, Voltar at Oregon
- ERT\_FLOPS=1, GPU\_BLOCKS=640, GPU\_THREADS=256
- Bandwidth: HBM 828GB/s, L2 3TB/s
- GFLOP/s: 200GFLOP/s



- → These are the peak bandwidths!
- Still in a **bandwidth-bound** regime





 $\rightarrow$ 

### **Missing L1 Bandwidth**



• Unified cache size is 128KB (L1 data + shared memory) per SM; L2 cache size is 6MB

7

- Similar size: aggregated L1 size vs L2
- Filling up L1 and L2 at the same time





### **Peak GFLOP/s**



- NVIDIA V100, Voltar at Oregon
- ERT\_FLOPS=1024, GPU\_BLOCKS=640, GPU\_THREADS=256
- Bandwidth: HBM 100GB/s

 $\rightarrow$  ERT is now in a **compute-bound** regime

GFLOP/s: 7TFLOP/s

 $\rightarrow$  This is the peak GFLOP/s!





### **Empirical vs. Theoretical Ceilings**



- Theoretical compute ceilings on V100:
  - FMA: 80 SMs x 32 FP64 cores/SM x 2 FLOPs/FMA x 1.53 GHz = 7.83 TFLOP/s
  - No-FMA: 7.83 TFLOP/s /2 = 3.92 TFLOP/s
- Theoretical memory bandwidths on V100:
  - HBM: 900 GB/s
  - L2: 4.1 TB/s
  - L1: ~14 TB/s

http://on-demand.gputechconf.com/gtc/2018/ presentation/s81006-volta-architecture-andperformance-optimization.pdf







## **Measure Application Performance**





### **Application Performance**



• Three raw measurements: Runtime, FLOPs, Bytes (on a memory/cache level)

 $Performance = \frac{nvprof FLOPs}{Runtime} , \qquad Arithmetic Intensity = \frac{nvprof FLOPs}{nvprof Data Movement}$ 

- Runtime:
  - time per invocation of a kernel
    - nvprof --print-gpu-trace ./application
  - average time over multiple invocations
     nvprof --print-gpu-summary ./application
  - same kernel with different input parameters are grouped separately





### **Application Performance**



- FLOPs:
  - predication aware, and divides aware, dp/dp\_add/dp\_mul/dp\_fma, sp\*
  - nvprof --kernels `kernel\_name' --metrics `flop\_count\_xx' ./application
- Bytes for different memory/cache levels to construct hierarchical Roofline
  - nvprof --kernels `kernel\_name' --metrics `metric\_name'./application
  - (read transactions + write transactions) x transaction size

| Memory Level                                                  | Metrics                                                        | Transaction Size |  |
|---------------------------------------------------------------|----------------------------------------------------------------|------------------|--|
| L1                                                            | gld_transactions, gst_transactions                             | 32B              |  |
| L2                                                            | <pre>12_read_transactions, 12_write_transactions</pre>         | 32B              |  |
| Device Memory dram_read_transactions, dram_write_transactions |                                                                | 32B              |  |
| System Memory                                                 | <pre>system_read_transactions, system_write_transactions</pre> | 32B              |  |





### **Example Output**

Science



- [cjyang@voltar source]\$ nvprdf --kernels "1:7:smooth\_kernel:1" --metrics flop\_count\_dp --metrics gld\_transactions --metrics gst\_transactions --metrics l2\_read\_transactions --metrics l2\_write\_transactions --metrics dram\_read\_transactions --metrics dram\_write\_transactions --metrics sysmem\_read\_bytes --metrics sysmem\_write\_bytes ./backup-bin/hpgmg-fv-fp 5 8
- All metrics at once or one at a time: both are okay!
- Output in CSV; Python/Excel for multiple output files

| Invoc | itions     | Metric Name                                                                                                                                                               | Metric Description                                              | Min      | Max      | Avg      |
|-------|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------|----------|----------|----------|
| Devic | : "Tesla V | .00-PCIE-16GB (0)"                                                                                                                                                        |                                                                 |          |          |          |
| K     | rnel: voi  | i smooth_kernel <int=6, i<="" int="4," th=""><th><pre>int=8&gt;(level_type, int, int, double, double, int,</pre></th><th>double*,</th><th>double*)</th><th></th></int=6,> | <pre>int=8&gt;(level_type, int, int, double, double, int,</pre> | double*, | double*) |          |
|       | 1          | flop_count_dp                                                                                                                                                             | Floating Point Operations(Double Precision)                     | 30277632 | 30277632 | 30277632 |
|       | 1          | gld_transactions                                                                                                                                                          | Global Load Transactions                                        | 4280320  | 4280320  | 4280320  |
|       | 1          | gst_transactions                                                                                                                                                          | Global Store Transactions                                       | 73728    | 73728    | 73728    |
|       | 1          | 12_read_transactions                                                                                                                                                      | L2 Read Transactions                                            | 890596   | 890596   | 890596   |
|       | 1          | 12_write_transactions                                                                                                                                                     | L2 Write Transactions                                           | 85927    | 85927    | 85927    |
|       | 1          | dram_read_transactions                                                                                                                                                    | Device Memory Read Transactions                                 | 702911   | 702911   | 702911   |
|       | 1          | dram_write_transactions                                                                                                                                                   | Device Memory Write Transactions                                | 151487   | 151487   | 151487   |
|       | 1          | sysmem_read_bytes                                                                                                                                                         | System Memory Read Bytes                                        | Θ        | Θ        | •        |
|       | 1          | sysmem_write_bytes                                                                                                                                                        | System Memory Write Bytes                                       | 160      | 160      | 160      |
|       |            |                                                                                                                                                                           |                                                                 |          |          |          |
| 100   |            |                                                                                                                                                                           |                                                                 |          |          |          |





## **Plot Roofline**





### **Plot Roofline**



Runtime

Runtime, FLOPs, Bytes  $\rightarrow$  Arithmetic Intensity, application performance (GFLOP/s) •

nvprof FLOPs *nvprof* FLOPs Arithmetic Intensity =  $\frac{1}{nvprof}$  Data Movement Performance = -

- Python scripts using Matplotlib ٠
- https://github.com/cyanguwa/nersc-roofline/tree/master/Plotting ٠
- Simple example: plot roofline.py data.txt ٠
- Tweaking needed for more sophisticated plotting, see examples ٠





### **Plot Roofline**



- Simple example: plot\_roofline.py data.txt
- Roofline plot = Compute/Bandwidth ceilings + Two Coordinates per data point
- Accepts space-delimited list for values
- Use quotes to separate names/labels





## **Code Analysis**







- GPP (General Plasmon Pole) kernel from BerkeleyGW (Material Science)
- <u>https://github.com/cyanguwa/BerkeleyGW-GPP</u>
- Medium problem size: 512 2 32768 20
- Tensor-contraction, abundant parallelism, large reductions
- Low FMA counts, divides, complex double data type, HBM data 1.5GB

| do band = 1, nbands            | #threadblocks |  |
|--------------------------------|---------------|--|
| do igp = 1, ngpown             |               |  |
| <mark>do</mark> ig = 1, ncouls | #threads      |  |
| do iw = 1, nw                  | #unrolled     |  |
| compute; reductions            |               |  |





Highly parameterizable:

- Varying nw from 1 to 6 to increase arithmetic intensity
  - increasing FLOPs, same HBM data movement
- Striding ig loop to analyze impact of strided memory access
  - Split ig loop to two loops and place the 'blocking' loop outside

```
do band = 1, nbands  #threadblocks
  do igp = 1, ngpown
    do igs = 0, stride - 1 #threads
    do ig = 1, ncouls/stride
    do iw = 1, nw #unrolled
    compute; reductions
```





### **Analysis for GPP**

- Effects of varying AI, and FMA/no-FMA
- Appropriate counting of FLOPs for divides
- FLOPs on masked-out threads
- HBM Roofline (i.e. bytes are HBM bytes)
- Al increases as **nw** grows
- bandwidth bound → compute bound
- No-FMA converges to its ceiling
- But FMA doesn't
  - (-fmad=true/false)

Office of

Science

nvprof has taken care of these !





BERKELEY L

### Analysis for GPP

- HBM Roofline (i.e. bytes are HBM bytes)
- Stride size doubles  $\rightarrow$  AI halves
- compute bound  $\rightarrow$  bandwidth bound
- Cache line 32B; Each complex data 16B
- Al should bottom out at Stride = 2
- But instead Stride =4

Office of

Science

Prefetching may be in effect





# **Hierarchical Roofline**

٠

- GPP is more HBM bound than L2/L1 bound at low nw's •
- L1/L2 performance far from L1/L2 roof ٠ 104
- FLOPs < nw •
- HBM bytes: constant ٠

**Analysis for GPP** 

- L2 bytes: increasing at  $\alpha > 1$ ٠
- L1 bytes: constant ٠
- Steep jump in L2 curve at nw=2, 3 ٠



V100, GPP



nw=4

nw=5

nw=6





### **Analysis for GPP**



BERKELEY L

- Hierarchical Roofline
- At fixed **nw** (**nw**=6), striding leads to suboptimal memory coalescing
  - $\rightarrow$  L1 bytes doubles from stride 1 to stride 2; stays constant after that
  - $\rightarrow$  stride 2 = 16B x 2 = 1 transaction
  - → L2/DRAM AI drops as well
- At Stride = 8, L1/L2/DRAM performance dots converge to HBM bandwidth







- HPGMG (High-performance Geometric Multigrid) from Adaptive Mesh Refinement codes
- <u>https://bitbucket.org/nsakharnykh/hpgmg-cuda</u>
- Stencil code, F-cycles and V-cycles, GSRB smoother (Gauss-Seidel Red-Black)





NERSC

- Hybrid GPU and CPU code
- Example: hpgmg-fv 7 8
- 128^3 box x 8, Level 5-8 run on GPU, Level 1-4 on CPU
- Versions: GSRB\_FP, GSRB\_BRANCH









Office of

Science





• GSRB\_BRANCH should have half the FLOPs as GSRB\_FP, but same HBM/L1/L2 bytes



### **Analysis for HPGMG**

#### GSRB\_FP

- HBM AI increases as Level  $5 \rightarrow 8$
- Due to better surface: volume ratio
- Also more HBM bound
- L1 AI stays constant (roughly)
- FLOPs x 8 when Level +1
- L1 bytes x 8 when Level +1







### **Analysis for HPGMG**

#### **GSRB\_BRANCH**

- Half the FLOPs as GSRB\_FP; Same bytes •
- Thread predication/divergence •









- Methodology to profile applications on **GPUs** with **Hierarchical Roofline** 
  - Use ERT to obtain empirical compute/bandwidth peaks
  - Use nvprof to collect FLOPs and Bytes on various memory levels
  - Handy Python scripts at <u>https://github.com/cyanguwa/nersc-roofline</u>
- Hierarchical Roofline is very helpful in understanding performance bounds (compute/bandwidth), analyzing the effects of memory coalescing and thread divergence, and guiding performance optimization efforts.









#### **Thank You!**



