# NERSC Early KNL Experiences





### Yun (Helen) He NERSC/LBNL

NCAR Multi-core 6 Workshop Sept 13-14, 2016



## Introduction









### NERSC Exascale Science Application Program (NESAP)

NERSC YEARS at the FOREFRONT

- The NESAP program was launched in Fall 2014 to prepare NERSC user community for Cori KNL architecture
- 20 applications were selected as Tier 1 (with postdocs) and Tier 2 applications to work closely with Cray, Intel and NERSC staff. Additional 26 Tier 3 teams.
- 80% of NERSC hours are represented by Tiers 1,2,3 and proxy codes.

#### NESAP TIER 1 AND 2 APPLICATIONS

| Application   | Colonos Area        | Alconithus |
|---------------|---------------------|------------|
| Application   | Science Area        | Algorithm  |
| Boxlib        | Multiple            | AMR        |
| Chombo Crunch | Multiple            | AMR        |
| CESM          | Climate             | Grid       |
| ACME          | Climate             | Grid       |
| MPAS-O        | Ocean               | Grid       |
| Gromacs       | Chemistry / Biology | MD         |
| Meraculous    | Genomics            | Assembly   |
| NWChem        | Chemistry           | PW DFT     |
| PARSEC        | Material Sci.       | RS DFT     |
| Quantum       | Material Sci.       | PW DFT     |
| ESPRESSO      |                     |            |
| BerkeleyGW    | Material Sci.       | MBPT       |
| EMGEO         | Geosciences         | Sparse LA  |
| XGC1          | Fusion              | PIC        |
| WARP          | Accelerators        | PIC        |
| M3D           | Fusion              | CD/PIC     |
| HACC          | Astrophysics        | N-Body     |
| MILC          | HEP                 | QCD        |
| Chroma        | Nuclear Physics     | QCD        |
| DWF           | HEP                 | QCD        |
| MFDN          | Nuclear Physics     | Sparse LA  |





### **NERSC KNL System: Cori Phase 2**



- Cori KNL: 9,304 nodes. Main features:
  - Many cores: 68 cores per node, 4 hardware threads per core.
    - 3 times of cores (6 times of logical cores) per node than NERSC IvyBridge Edison.
  - Larger vector units (supports AVX-512 instruction set)
    - Dual 512-bit SIMD units with FMA: 32 double precision flops/cycle
    - Edison (IvyBridge) has 256-bit AVX: 8 double precision flops/cycle
  - On package high bandwidth memory: MCDRAM
    - 450 GB/sec STREAM measurement as compared to 85 GB/sec from DDR4.
    - No direct L3 cache
  - Burst Buffer
- Cori Phase 1 & Phase 2 under merge starting from Sept 19, 2016, for about 6 weeks
- NESAP teams will have access first in early Nov
- Gating procedure for general users
  - Need to show performance and scaling effort/results







- Carl (white boxes from Intel, single nodes only)
  - B0: 64 cores @1.3 GHz
  - B1: 68 cores @1.4 GHz
- Gerty (test system from Cray, with Aries network)
  - Similar to real Cori Phase1 & Phase2 system
  - P1: Haswell. Dual sockets, 16 cores/socket @ 2.3 GHz
  - P2: KNL. B1. 68 cores. @1.4 GHz

### • All KNL nodes have

- 4 hardware threads per core
- 9600 GB DDR4 and 16 GB MCDRAM





## **Intel Tools are Useful**











 Use -qopt-report=5 for detailed compiler reports on which optimizations have been performed, why certain loops are vectorized or not, etc.





### **Intel VTune**



### Memory Access

- Detect memory hierarchy access issues (such as false sharing) and NUMA problems
- measure DRAM and MCDRAM bandwidth
- suggest data structures to allocate to MCDRAM
- General Exploration
  - Code efficiency
- Advanced Hot Spots
  - MPI/OpenMP load balance, potential gain

| Grouping: Bandwidth Domain / Band                                                      | on Stack 🗸 🗸    | ь. О            | ×           |                   |                                             |   |
|----------------------------------------------------------------------------------------|-----------------|-----------------|-------------|-------------------|---------------------------------------------|---|
| Bandwidth Domain / Bandwidth<br>Utilization Type / Memory Object<br>/ Allocation Stack | Memory<br>Bound | Loads           | Stores      | LLC Miss<br>Count | Average<br>Latency <del>•</del><br>(cycles) | ^ |
| DRAM, GB/sec                                                                           | 0.657           | 125,874,377,622 | 16,061,040  | 130,507,830       | 40                                          | Î |
| ⊟High                                                                                  | 0.750           | 28,236,084,708  | 5,014,875,  | 75,304,518        | 91                                          |   |
| ⊕ stream.c:180 (76 MB )                                                                |                 | 900,002,700     | 654,009,810 | 18,301,098        | 495                                         |   |
| stream.c:179 (76 MB)                                                                   |                 | 1,050,003,150   | 667,210,008 | 33,301,998        | 487                                         |   |
| stream.c:181 (76 MB )                                                                  |                 | 1,434,004,302   | 907,213,608 | 20,101,206        | 412                                         |   |
| Selected 1 row(s):                                                                     | 1.000           | 126,000,378     | 21,600,324  | 300,018           | 61                                          | 1 |

#### ○ OpenMP Analysis. Collection Time <sup>②</sup>: 28.061

Serial Time (outside any parallel region) (2: 12.203s (43.5%)

Serial Time of your application is high. It directly impacts application Elapsed Time and scalability. Explore options for parallelization, algorithm or microarchitecture tuning of the serial part of the application.

⊘ Parallel Region Time<sup>®</sup>:

Estimated Ideal Time <sup>©</sup>: OpenMP Potential Gain <sup>©</sup>: 15.858s (56.5%) 5.005s (17.8%) 10.853s (38.7%)

The time wasted on load imbalance or parallel work arrangement is significant and negatively impacts the application performance and scalability. Explore OpenMP regions with the highest metric values. Make sure the workload of the regions is enough and the loop schedule is optimal.

#### Diagram from Intel





### **Intel Advisor**



### Vectorization Advisor

- Sorts loops by potential performance gain
- Vectorization analysis
- Memory access pattern data
- Roofline Analysis

### • Threading advisor

- Threading design tool
- Suitability analysis with expected speedup

|                       | Filter by w<br>are vect       |        |                                                  | Trij         | o Cou      | nts            |                      | What prevents vectorization? |       |                        |            |
|-----------------------|-------------------------------|--------|--------------------------------------------------|--------------|------------|----------------|----------------------|------------------------------|-------|------------------------|------------|
| 🌪 Summa               | ry 🕊 Survey 🛛 rt 🛛            | 🍃 Refi | zation and/or threadi<br>nement Reports 💧 Annota |              |            | ort            |                      |                              |       | ntel Advis             | or XE 2016 |
| Elapsed tin           | ne: 54.44s Vectorized         | Not    | · Vectorized 전 FILTE                             | R: All Modul | es V       | Sourc          | es 🗸 🗸               |                              |       |                        | 9          |
| Function Ca           | II Sites and Loops            | ۵      |                                                  | Self Time 🕶  | Total Time | Trip<br>Counts | Loop Type            | Why No Vectorization?        |       | ed Loops<br>Efficiency | Vector L.  |
| ः 🖱 [loop ə           | t stl_algo.h:4740 in std::tr. |        |                                                  | 0.170s I     | 0.17Ds I   |                | Scalar               | non-vectorizable loop ins    |       |                        |            |
|                       | oopstl.cpp:2449 in s234       |        | ♀ 2 Ineffective peeled/rem.                      | 0.170s I     | 0.170s I   | 12; 4          | Collapse             | Collapse                     | AVX   | ~100%                  | 4          |
| 1) 🖳 [loo             | at loopstl.cpp:2449 in s      | . 🗆    |                                                  | 0.150s I     | 0.150s I   | 12             | Vectorized (Body)    |                              | AVX   |                        | 4          |
| i> <mark>℃</mark> [lo | at loopstl.cpp:2449 in s      | . 🗆    |                                                  | 0.020s1      | 0.020s I   | 4              | Remainder            |                              |       |                        |            |
| > 🖱 [loog             | loopstl.cpp:7900 in vas_]     |        |                                                  | 0.170s I     | 0.170s I   | 500            | Scalar               | vectorization possible but   |       |                        | 4          |
| I 🔝 [loo              | loopstl.cpp:3509 in s2        |        | 💡 <u>1</u> High vector register                  | 0.160s       | 0.160s     | 12             | Expand               | Expand                       | A¥X   | ~69%                   | 8          |
| 🖳 🚺 🛛                 | loopstl.cpp:3891 in s279_     | ]      |                                                  | 0.150s I     | 0.150s I   | 125; 4         | Expand               | Expand                       | AVX   | ~96%                   | 8          |
| I 🔛 []                | t loopstl.cpp:6249 in s414_   | J      |                                                  | 0.150s I     | 0.150s I   | 12             | Expand               | Expand                       | AVX   | ~100%                  | 4          |
| >O(                   | t stl_numeric.h:247 in std.   | 🗆      | ♀1 Assumed dependency                            | 0.150s I     | 0.150s I   | 49             | Scalar               | vector dependence preve      |       |                        |            |
| ¢ Í                   |                               |        |                                                  |              |            |                |                      |                              |       |                        | ,          |
|                       | us on<br>loops                |        | nat vectorizat<br>sues do I hav                  |              | W          |                | Vector in<br>e being | nstructions<br>used?         |       | low effi<br>s the co   |            |
| Y                     | Science                       |        |                                                  |              |            | - 9 -          |                      | Diaaran                      | n fra | om Inte                | >/         |



### **Intel Inspector**



- Detect memory errors
  - memory leak
  - data race, deadlock etc.
- Memory growth measurement

|                                     | Targe       | et Analys                             | s Type                             | Collection I  | Log 🥚 Summary       |                                             |          |                  |                    |
|-------------------------------------|-------------|---------------------------------------|------------------------------------|---------------|---------------------|---------------------------------------------|----------|------------------|--------------------|
| robl                                |             |                                       |                                    |               |                     |                                             | 8        | Filters          | Sort 🗸 🥳           |
| ) 🔺                                 | ۹           | Туре                                  | Sources                            | Modules       | State               |                                             |          | Severity         |                    |
| P1                                  | 2           | Data race                             | kthi-race.c                        | xthi-race.imp | i 🎙 New             |                                             |          | Error            | 1 item(s)          |
|                                     |             | Data race                             | dhi-race                           | xthi-race.imp | i 🎙 New             |                                             |          | Туре             |                    |
|                                     |             | Data race                             | dhi-race                           | xthi-race.imp | i 🎙 New             |                                             |          | Data race        | 1 item(s)          |
|                                     |             |                                       |                                    |               |                     |                                             |          | Source           |                    |
|                                     |             |                                       |                                    |               |                     |                                             |          | xthi-race.c      | 1 item(s)          |
|                                     |             |                                       |                                    |               |                     |                                             |          | Module           |                    |
|                                     |             |                                       |                                    |               |                     |                                             |          | xthi-race.impi   | 1 item(s)          |
|                                     |             |                                       |                                    |               |                     |                                             |          | State            |                    |
|                                     |             |                                       |                                    |               |                     |                                             |          | New              | 1 item(s)          |
|                                     |             |                                       |                                    |               |                     |                                             |          | Suppressed       |                    |
|                                     |             |                                       |                                    |               |                     |                                             |          | Not suppressed   | 1 item(s)          |
|                                     |             |                                       |                                    |               |                     |                                             |          | Investigated     |                    |
|                                     |             |                                       |                                    |               |                     |                                             |          | Not investigated | 1 item(s)          |
| 10                                  | )           |                                       |                                    | 1 of 4 12     | All Code Locations: | Data race                                   | ନ୍ତ Time | line             |                    |
|                                     | ption       | Source                                | Function                           |               |                     |                                             |          |                  |                    |
| Nrite                               |             | xthi-race.c:                          |                                    | xthi-race.    | immi                |                                             |          |                  |                    |
| 152                                 |             |                                       |                                    |               | izeof(coremask),    | xthi-race.impi!main - xthi-race.            | c        | OMP Worker T     | hread #3 (44300)¶  |
| 53                                  |             | cpuset_to_                            |                                    |               |                     |                                             |          | OMP Worker T     | hread #7 (44316) 🗘 |
|                                     |             | global_cou                            |                                    |               |                     |                                             |          |                  | (1010)             |
| 54                                  |             | #pragma om                            |                                    |               |                     |                                             |          |                  |                    |
| 55                                  | 5           | printf("He                            | lo from                            | level 1: r    | rank= %d, thread    |                                             |          |                  |                    |
|                                     |             | xthi-race.c:                          |                                    | xthi-race.    |                     |                                             | _        |                  |                    |
| 55<br>56<br>Vrite                   |             | (void)sche                            |                                    |               | zeof(coremask),     | <pre>xthi-race.impi!main - xthi-race.</pre> | c        |                  |                    |
| 55<br>56<br>Vrite                   | 2           |                                       | etr (Scor                          | emask. clb    | out);               |                                             |          |                  |                    |
| 59<br>56<br>Vrite<br>52<br>53       | 2           | cpuset_to_                            |                                    | emable, etc   |                     |                                             |          |                  |                    |
| 55<br>56<br>Vrite<br>52<br>53<br>54 | 2<br>3      | global_cou                            | nter++;                            |               |                     |                                             |          |                  |                    |
| 55<br>56<br>Vrite<br>52             | 2<br>3<br>1 | <mark>global_cou</mark><br>#pragma om | n <mark>ter++;</mark><br>b barrier |               | rank= %d, thread    |                                             |          |                  |                    |



# Guide and Understand Optimization with Roofline Model









## **The Roofline Model**



### **KNL Roofline Results**



| • | Using | 2 | threads/ | core |
|---|-------|---|----------|------|
|   | 0     |   |          |      |

- Max L1, L2 and MCDRAM
  - 1 FLOP/iteration
  - 4 MPI + 32 threads
- Max GFLOP/s
  - 64 FLOPs/iteration
  - 2 MPI + 64 threads

|                | Quad<br>Cache | Quad Flat | SNC2           | SNC4           | Peak <sup>a</sup>       |  |  |
|----------------|---------------|-----------|----------------|----------------|-------------------------|--|--|
| GFLOP/s        | 2,205         | 2,199     | 2,224          | 2,212          | 2,253                   |  |  |
| L1             | 5,894         | 6,040     | 5 <i>,</i> 889 | 6 <i>,</i> 055 | 9,011                   |  |  |
| L2             | 1,834         | 1,827     | 1,829          | 1,840          | 2,252 <sup>b</sup>      |  |  |
| MCDRAM         | 345           | 372       | 381            | 415            | <b>420</b> <sup>c</sup> |  |  |
| DDR            |               | 77.0      | 76.9           | 76.9           | 102                     |  |  |
| ENERGY Science |               |           |                |                |                         |  |  |

All Bandwidths are in GB/s
(a) Values assume an AVX frequency of 1.1 GHz
(b) L2 assumed ~(L1 / 4)?
(c) MCDRAM BW is for 1R/1W per iteration

*Slide from Doug Doerfler et. al., IXPUG at ISC2016* 



## **How to Measure Arithmetic Intensity**



- <u>http://www.nersc.gov/users/application-performance/</u> measuring-arithmetic-intensity/
- Intel SDE measures "Total Flops"
- Intel Vtune measures "Total Bytes"
- Haswell consistently attains a higher arithmetic intensity than KNL
  - KNL generally moves more data to/from memory than Haswell due to lack of L3 cache
  - The higher theoretical performance benefits of MCDRAM bandwidth may not be fully realized due to this extra data movement





## **PICSAR Example**



#### **Optimizations**

- Original code spatially decomposes the problem with MPI
- MPI subdomains are subdivided into large number of tiles handled with OpenMP, improving memory locality, hence cache reuse of tiles, and load balance.
- Deposition and Interpolation steps were rewritten to enable more efficient vectorization, plus particle cell sorting was added to again improve memory locality and hence cache reuse.

#### **Observations**

S. DEPARTMENT OF

Office of Science

- Tiling and vectorization increase the arithmetic intensity to take advantage of additional effective memory bandwidth.
- Not memory bound so more optimization potential.

- 15 -



Work by Mathieu Lobet et. al.



### **MFDn Example**



- Optimizations
  - Use case requires all memory on node (HBM + DDR). Explicitly place important arrays into MCDRAM with FASTMEM directives.
  - Use blocked (nRHS) to improve bandwidth and locality (the larger sparse matrix resides in DDR4)

#### Observations

- Code is highly memory bandwidth bound
- More RHS helps to increase arithmetic intensity. However, the number of RHS is limited by MCDRAM capacity.





Work by Brandon Cook et al.

- 16 - Slide adapted from Doug Doerfler et. al., IXPUG at ISC2016



### **BerkeleyGW Example**

#### Optimization Steps

- 1. Refactor (3 Loops for MPI, OpenMP, Vectors)
- 2. Add OpenMP
- 3. Initial Vectorization (loop reordering, conditional removal)
- 4. Cache-Blocking to better reuse last level cache
- 5. Improved Vectorization
- 6. Add hyper-threading

#### Observations

- Arithmetic Intensity reduced from step 2 to 3. Problem size larger than L2 but Haswell has L3 to catch
- From step 3 to 4: No Haswell speedup since it fits in L3. Good improvement for KNL
- Has potential for further optimization



Science

#### Sigma Optimization Process













# Boxlib Example

- Optimization
  - Loop tiling: Divide boxes into smaller tiles. Divide tiles among OpenMP threads
- Observations
  - Memory bandwidth bound
  - Effective L2 and L3 cache reuse on Haswell.













# **Overall NESAP Optimization Results**





### **Results from the NESAP teams**





Science





**NESAP\*** Speedups



- Significant speed usually involves code restructuring to improve vectorization and data locality.
- Speedup is mostly larger on KNL since fewer and faster Haswell (with L3 cache) is more forgiving to imperfect thread scaling and vectorization. (WARP, BerkeleyGW)
  - Boxlib is an exception. Tiling has more benefit for memory bandwidth bound on Haswell (no HBM)





### Speedup on KNL*vs* Haswell



#### Speedup on KNL vs Haswell



• EMGeo, MILC, and Chroma see large KNL vs. Haswell speedup: memory bandwidth bound. Speedup came from effectively use MCDRAM.





### **KNL/Haswell Memory Hierarchy Speedups**

#### NERSC YEARS at the FOREFRONT

#### KNL/Haswell Memory Hierarchy Speedups



- EMGeo, MILC, Chroma, MFDn are memory bandwidth bound. Speedup mostly matches MCDRAM vs. DDR bandwidth ratio
- KNL generally moves more data to/from memory than Haswell due to lack of L3 cache
- Codes effectively use L3 cache may perform better on Haswell than on KNL: Boxlib, XGC1





## **KNL AVX and FMA Speedups**



#### KNL AVX and FMA Speedups



- BerkeleyGW sees large AVX512 (vectorization effect) vs. scalar instructions.
- Not many codes see large effect with FMA. However, FMA used in libraries are not measured here.





# MPI/OpenMP Process and Thread Affinity









### **Affinity Goal**



- Correct process and thread affinity for hybrid MPI/OpenMP programs is the base for getting optimal performance on KNL. It is also essential for guiding further performance optimizations.
- Our goal is to promote OpenMP4 standard settings for portability.
   For example, OMP\_PROC\_BIND and OMP\_PLACES are preferred to Intel specific KMP\_AFFINITY settings.
- Discovered in an Intel Dungeon session with CESM HOMME that OMP settings ran slower than KMP settings.
  - What can be the cause?
  - Investigation started on this ...







Expect to see same performance from all 7 cases on a 64-core KNL quad flat node

- case 1: mpirun -n 32 -env OMP\_NUM\_THREADS 2 -env KMP\_AFFINITY compact, verbose -env KMP\_PLACE\_THREADS 1T numactl -m 1 ./app
- case 2: mpirun -n 32 -env KMP\_AFFINITY compact, verbose -env KMP\_PLACE\_THREADS 2C,1T numactl -m 1 ./app
- case 3: mpirun -n 32 -env OMP\_NUM\_THREADS 2 -env KMP\_AFFINITY scatter, verbose numactl m 1 ./app
- case 4: mpirun -n 32 -env OMP\_NUM\_THREADS 2 -env OMP\_PROC\_BIND spread -env OMP\_PLACES threads numactl -m 1 ./app
- case 5: mpirun -n 32 -env OMP\_NUM\_THREADS 2 -env OMP\_PROC\_BIND close -env OMP\_PLACES cores numactl -m 1 ./app
- case 6: mpirun -n 32 -env KMP\_AFFINITY scatter, verbose -env KMP\_PLACE\_THREADS=2C,1T numactl -m 1 ../app
- case 7: mpirun -n 32 -env KMP\_AFFINITY scatter, verbose -env KMP\_PLACE\_THREADS=2C,1T -env OMP\_NUM\_THREADS 2 numactl -m 1 ./app





### **Affinity Analysis**



- Confirmed with another application this is the case (same performance from all tests)
- Confirmed with my simple affinity test case that core bindings are all equivalent
- However, initial runs see different results from the 7 test cases for HOMME. Some cases are >2X slower. Also quad cache performance is about 5% slower.
- Further investigations showed even though asking for 2 threads only, the code is running with 4. It was discovered later nested OpenMP is set in the code by default!
- Using the modified code with explicit num\_threads clauses specified for nested OpenMP regions, all 7 cases then perform the same on quad flat and quad cache nodes.



### **HOMME Single-Node Scaling**



Good nested OpenMP scaling achieved after code bug is fixed







## What About a 68-core KNL Node?



% mpirun -n 8 -env OMP PROC BIND spread -env OMP PLACES threads -env OMP NUM THREADS 4 ./xthi |sort -k4n.6n core 0: 0, 68, 136, 204 Hello from rank 0, thread 0, on ekm118. (core affinity = 0) core 1: 1, 69,137, 205 Hello from rank 0, thread 1, on ekm118. (core affinity = 70) core 2: 2, 70, 138, 206 Hello from rank 0, thread 2, on ekm118. (core affinity = 72) Hello from rank 0, thread 3, on ekm118. (core affinity = 142) Hello from rank 1, thread 0, on ekm118. (core affinity = 144) core 64: 64, 132, 200, 268 Hello from rank 1, thread 1, on ekm118. (core affinity = 214) Hello from rank 1, thread 2, on ekm118. (core affinity = 216) core 67: 67, 135, 203, 271 Hello from rank 1, thread 3, on ekm118. (core affinity = 15) core 68: 68, 136, 204, 272

Use I\_MPI\_PIN\_DOMAIN to set to number of logical cores per MPI task. Otherwise, OMP tasks are crossing tile boundaries. Good to waste extra 4 cores on purpose if #MPI tasks is not divisible by 68.

% mpirun -n 8 -env OMP\_PROC\_BIND spread -env OMP\_PLACES threads -env OMP\_NUM\_THREADS 4 -env I\_MPI\_PIN\_DOMAIN 32 ./xthi | sort -k4n,6n

Hello from rank 0, thread 0, on ekm118. (core affinity = 0) Hello from rank 0, thread 1, on ekm118. (core affinity = 2) Hello from rank 0, thread 2, on ekm118. (core affinity = 4) Hello from rank 0, thread 3, on ekm118. (core affinity = 6) Hello from rank 1, thread 0, on ekm118. (core affinity = 8)

Hello from rank 7, thread 0, on ekm118. (core affinity = 56) Hello from rank 7, thread 1, on ekm118. (core affinity = 58) Hello from rank 7, thread 2, on ekm118. (core affinity = 60) Hello from rank 7, thread 3, on ekm118. (core affinity = 62)



- xthi.c and xthi-nested.c test codes available upon request
- Requested to OpenMP Standard to provide Intel KMP\_AFFINITY=verbose or CRAY\_OMP\_CHECK\_AFFINITY=TRUE equivalent



# **Nested OpenMP Thread Affinity**



- Again, I\_MPI\_PIN\_DOMAIN is important
- Sample settings for 2 MPI tasks, 4 outer OpenMP threads, and 4 inner OpenMP threads:
  - % export OMP\_NUM\_THREADS=4,4
  - % export OMP\_PROC\_BIND=spread,close
  - % export OMP\_PLACES=threads
  - % export OMP\_NESTED=true
  - % export I\_MPI\_PIN\_DOMAIN=128 # 32 physical cores
  - % export KMP\_HOT\_TEAMS=1
  - % export KMP\_HOT\_TEAMS\_MAX\_LEVELS=2
- Use num\_threads clause in source codes to set threads for nested regions.
   For most other non-nested regions, use OMP\_NUM\_THREADS environment variable for simplicity and flexibility.





# Choice of Default Cluster and Memory Modes









### **Available Modes for KNL Nodes**



- KNL has configurable on-chip interconnect for NUMA and memory mode.
- Sub-NUMA Cluster (SNC) modes
  - No SNC (Quad, all-2-all, Hemisphere), SNC-2, SNC-4
- Memory modes
  - Cache, Flat, Hybrid
- No SNC and Cache modes are relatively easier to use
- Takes about 11 to 26 min of reboot time in order to switch to another mode
- Ongoing analysis for setting default mode(s) for NERSC (> 5,000 users, 800 projects)?





### **General Strategies and Observations**



- If application memory <=16GB, use Flat mode to allocate all in MCDRAM is best. If not, manual placement is needed.
- Cache mode gives pretty good start for most apps.
- Cache mode can be beaten by Flat mode + manual data placement in MCDRAM.
- Performance with Cache mode can vary a lot more than Flat mode.
  - Different memory placement of allocations and different fragmentation caused by previous jobs.
- SNC4/SNC2 provide small advantages over quadrant mode for some (not all) apps, but relatively harder to use. (different number of cores per NUMA domain for SNC4)
- We have yet to try Hybrid mode with MCDRAM
- Default cluster and memory mode(s) on Cori have not been finalized.
  - Most likely, quad flat (and some quad cache and/or SNC4 flat).
  - Also not finalized whether to allow users to switch modes (most likely yes to a certain extent, but node reboot time will be billed as run time)





### Summary



- Optimizing for KNL requires good thread level scaling (OpenMP), vectorization and effective use of HBM.
- Use available Intel tools to help identify areas to work on for optimization.
- Use Roofline model to guide optimization potential.
- Correct process and thread affinity is the base for getting optimal performance.
- Keep portability in mind, use portable programming models.







### Thank you.



