

# Using NERSC for Research in High Energy Physics Theory

#### **Richard Gerber**

Senior Science Advisor

HPC Department Head (Acting)





Office of Science

#### NERSC: the Mission HPC Facility for DOE Office of Science Research





Office of Science

Largest funder of physical science research in U.S.



Bio Energy, Environment

Particle Physics, Astrophysics



Computing

**Nuclear Physics** 



Materials, Chemistry, Geophysics



Fusion Energy, Plasma Physics







Office of

Science

#### **Current Production Systems**





#### Edison

5,560 Ivy Bridge Nodes / 24 cores/node 133 K cores, 64 GB memory/node Cray XC30 / Aries Dragonfly interconnect 6 PB Lustre Cray Sonexion scratch FS

### **Cori Haswell Nodes**

1,900 Haswell Nodes / 32 cores/node 52 K cores, 128 GB memory/node Cray XC40 / Aries Dragonfly interconnect 24 PB Lustre Cray Sonexion scratch FS 1.5 PB Burst Buffer



3





Office of

Science

### **Cori Xeon Phi KNL Nodes**



Science

Cray XC40 system with 9,300 Intel Knights Landing compute nodes

68 cores / 96 GB DRAM / 16 GB HBM

Support the entire Office of Science research community

Begin to transition workload to energy efficient architectures

Data Intensive Science Support 10 Haswell processor cabinets (Phase 1) NVRAM Burst Buffer 1.5 PB, 1.5 TB/sec 30 PB of disk, >700 GB/sec I/O bandwidth

Integrated with Cori Ha swell nodeson Aries network for data / simulation / analysis on one system







# **NERSC Allocation of Computing Time**





#### DOE Mission Science 80% Distributed by DOE SC program managers

## ALCC 10%

Competitive awards run by DOE ASCR

# Directors Discretionary 10%

Strategic awards from NERSC





#### NERSC has ~100% utilization



Important to get support and allocation from DOE program manager (L. Chatterjee) or through ALCC!

They are supportive.

| PI                | Allocation (Hrs) | Program        |
|-------------------|------------------|----------------|
| Childers/Lecompte | 18,600,000       | ALCC           |
| Hoeche            | 9,000,000        | DOE Production |
| Hinchliffe        | 800,000          | DOE Production |
| Ligeti            | 2,800,000        | DOE Production |
| Piperov           | 1,500,000        | DOE Production |



NERSC



#### **Initial Allocation Distribution Among Offices for 2016**







Nersc

**BERKELEY LAB** 







Office of Science

## **Cori Integration Status**

#### **July-August** 9300 KNL nodes arrive, installed, tested Monday P1 shut down, P2 stress test This week Move I/O, network blades Add Haswell to P1 to fill holes Cabling/Re-cabling **Aries/LNET config Cabinet reconfigs** Now to now+6 weeks ...continue, test, resolve issues configure SLURM **NESAP code team access ASAP!**









# **Knights Landing Overview**



 TILE
 2 VPU
 CHA
 2 VPU

 Core
 1MB
 2 Core

Chip: 36 Tiles interconnected by 2D Mesh Tile: 2 Cores + 2 VPU/core + 1 MB L2

Memory: MCDRAM: 16 GB on-package; High BW DDR4: 6 channels @ 2400 up to 384GB IO: 36 lanes PCIe Gen3. 4 lanes of DMI for chipset Node: 1-Socket only Fabric: Omni-Path on-package (not shown)

Vector Peak Perf: 3+TF DP and 6+TF SP Flops Scalar Perf: ~3x over Knights Corner Streams Triad (GB/s): MCDRAM : 400+; DDR: 90+

Source Intel: All products, computer systems, dates and figures specified are preliminary based on purrent expectations, and are subject to change without notice, KNL data are preliminary based on current expectations and are aution to change without notice, 1Binary Compatible with Intel Xeon processors using Haswell laerotion Set (except 1933, "Bandwidth numbers are based on STREAM-Like mamory access pattern when MCORAM used as Lish memory. Results have been estimated based on internal Intel analysis and as provided to thomarcond purposes only. Any otherance in system

contamarice,

### Key Intel Xeon Phi (KNL) Features



#### Single socket self-hosted processor

- (Relative!) ease of programming using portable programming models and languages (MPI+OpenMP)
- Evolutionary coding model on the path to manycore exascale systems
- Low-power manycore (68) processor with up to 4 hardware threads
- 512b vector units
- Opportunity for 32 DP flops / clock (2 VPU \* 64b \* FMA)
- 16 GB High bandwidth on-package memory
- Bandwidth 4-5X that of DDR4 DRAM memory
- Many scientific applications are memory-bandwidth bound





#### **Top Level Parallelism**





#### **Opportunity cost: 9300X**





### **Thread-Level Parallelism for Xeon Phi Manycore**





#### Xeon Phi "Knights Landing"

68 Cores with 1-4 threads

Commonly using OpenMP to express threaded parallelism







## **On-Chip Parallelism – Vectorization (SIMD)**



Single instruction to execute up to 16 DP floating point operations per cycle per VPU.

32 Flop / cycle / core 44 Gflops / core 3 TFlops / node

#### Scalar mode

(one instruction produces one result)

#### а a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] aſï] aľi+ + b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] Ь b[ï] p[i] a+b a[i]+b[i] c[i+3] c[i+2] c[i+1] c[i+7] c[i+6] c[i+5] c[i+4] c[i]

(one instruction can produce multiple results)

SIMD processing

| TILE | 2 VPU | СНА       | 2 VPU |  |  |
|------|-------|-----------|-------|--|--|
|      |       | 1MB<br>L2 |       |  |  |
|      | Core  |           | Core  |  |  |





#### **Knights Landing Integrated On-Package Memory**



Cache Model Let the hardware automatically manage the integrated on-package memory as an "L3" cache between KNL CPU and external DDR

Flat<br/>ModelManually manage how your application uses the<br/>integrated on-package memory and external DDR<br/>for peak performance

HybridHarness the benefits of both cache and flat modelsModelby segmenting the integrated on-package memory



# Maximum performance through higher memory bandwidth and flexibility





Office of

Science

# Data layout crucial for performance

Enables efficient vectorization

Cache "blocking"

#### Fit important data structures in 16 GB of MCDRAM

#### MCDRAM memory/core = 235 MB

# DDR4 memory/core = 1.4 GB

#### Knights Landing: Next-Generation Intel® Xeon Phi<sup>™</sup> Architectural Enhancements = ManyX Performance

Binary-compatible with Intel® Xeon®

processors Based on Intel® Atom™ core (based on Silvermont microarchitecture) 60+ cores High-Performance with Enhancements for HPC 3+ Teraflops<sup>1</sup> Memory ✓ 14nm process technology DDR4 Over 5X 3x Single-Thread<sup>2</sup> ✓ 4 Threads/Core STREAM vs. DDR4<sup>a</sup> Capacity ✓ Deep Out-of-Order Buffers 2-D Core Mesh Comparable Up to ✓ Gather/Scatter to Intel® 16 GB Cache Coherency Xeon® at launch Better Branch Prediction Processors Higher Cache Bandwidth NUMA Integrated Fabric support and many more Core Server Processor In partnership with icron "Other locos, brands and names are the property of their respective owners nnary based on current expectations, and are subject to change without notice. on current expectations of cores, clock frequency and floating point operations per cycle nce relative to 1<sup>er</sup> Generation Intel®-Xeon Phi<sup>m</sup> Coprocessor T120P (Commenty contentemed Knights Corner) intel analysis of STREAU benchmark usion a Kninhts Landon processor 🖷



Goal: Prepare DOE Office of Science users for many core

Partner closely with ~20 application teams and apply lessons learned to broad NERSC user community

#### NESAP activities include:



#### **Resources for Code Teams**



#### • Early access to hardware

- Early "white box" test systems and testbeds
- Early access and significant time on the full Cori system
- Technical deep dives
  - Access to Cray and Intel staff on-site staff for application optimization and performance analysis
  - Multi-day deep dive ('dungeon' session) with Intel staff at Oregon Campus to examine specific optimization issues
- User Training Sessions
  - From NERSC, Cray and Intel staff on OpenMP, vectorization, application profiling
  - Knights Landing architectural briefings from Intel
- NERSC Staff as Code Team Liaisons (Hands on assistance)
- 8 Postdocs





#### **NERSC NESAP Staff**







Katie Antypas

Nick Wright



**Richard Gerber** 



Brian Austin







Stephen Leak

Woo-Sun Yang



Rebecca Hartman-Baker

Doug Doerfler



Jack Deslippe



Brandon Cook



Thorsten Kurth



Brian Friesen







#### **NESAP** Postdocs





Taylor Barnes Quantum ESPRESSO



Zahra Ronaghi



Andrey Ovsyannikov Chombo-Crunch



Mathieu Lobet WARP



Tuomas Koskela XGC1



Tareq Malas EMGeo







#### **NESAP Code Status (Work in Progress)**

|                | GFLOP/s KNL | Speedup HBM /<br>DDR | Speedup KNL /<br>Haswell |                  | GFLOP/s KNL | Speedup<br>HBM / DDR | Speedup KNL /<br>Haswell |
|----------------|-------------|----------------------|--------------------------|------------------|-------------|----------------------|--------------------------|
| Chroma (QPhiX) | 388 (SP)    | 4                    | 2.71                     | DWF              | 600 (SP)    |                      | 0.95                     |
| MILC           | 117.4       | 3.8                  | 2.68                     | WARP             | 60.4        | 1.2                  | 1.0                      |
| CESM (HOMME)   |             |                      | 1.8                      | Meraculous       |             |                      | 0.75                     |
| MFDN (SPMM)    | 109.1       | 3.6                  | 1.62                     | Boxlib           |             | 1.13                 | 1.1                      |
| BGW Sigma      | 279         | 1.8                  | 1.61                     | Quantum ESPRESSO |             |                      | 1                        |
| НАСС           | 1200        |                      | 1.41                     |                  |             |                      |                          |
| EMGEO (SPMV)   | 181.0       | 4.2                  | 1.16                     | XGC1 (Push-E)    | 8.2         | 0.82                 | 0.2-0.5                  |
|                | 101.0       | τ.2                  | 1.10                     | Chombo           |             |                      | 0.5-1.5                  |

NESAP\* Code/Kernel Speedups



\*Speedups from direct/indirect NESAP efforts as well as coordinated activity in NESAP timeframe



Setting requirements for Dungeon Session (Dungeon Session Worksheet).

- Engagement with IXPUG and user communities (DFT, Accelerator Design for Exascale Workshop at CRT)
- Learned a massive amount about tools and architecture
- Large number of NERSC and vendor training events (Vectorization, OpenMP, Tools/Compilers)
- Cray COE VERY helpful to work with. Very pro-active.
- Pipelining code work via Cray and Intel experts
- Case studies on the web to transfer knowledge to larger community







# EXTRA SLIDES

#### Why You Need Parallel Computing: The End of Moore's Law?



2X transistors/Chip Every 1.5 years Called "Moore's Law"

#### Microprocessors have become smaller, denser, and more powerful.



Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

Slide source: Jack Dongarra

Nersc





# **Power Density Limits Serial Performance**



Concurrent systems are more power efficient

- Dynamic power is proportional to V<sup>2</sup>fC
- Increasing frequency (f) also increases supply voltage (V) → cubic effect
- Increasing cores increases capacitance (C) but only linearly
- Save power by lowering clock speed

High performance serial processors waste power

- Speculation, dynamic dependence checking, etc. burn power
- Implicit parallelism discovery

**BERKELEYLAB** 

More transistors, but not faster serial processors





### Processor design for performance and power





Exponential performance continues

Single-thread performance flat or decreasing

Power under control ( $P \sim f^{2-3}$ )

Number of cores / die grows





- Number of cores per chip will increase
- Clock speed will not increase (possibly decrease)
- Need to deal with systems with millions of concurrent threads
- Need to deal with intra-chip parallelism (OpenMP threads) as well as inter-chip parallelism (MPI)
- Any performance gains are going to be the result of increased parallelism, not faster processors





#### **Un-optimized Serial Processing = Left Behind**











# **Application Portability**



- DOE Office of Science will have at least two HPC architectures
  - NERSC and ALCF will deploy Cray-Intel Xeon Phi many core based systems in 2016 and 2018
  - OLCF will deploy and IBM Power/NVIDIA based system in 2017
- Question: Are there best practices for achieving performance portability across architectures?
- What is "portability"?
  - ! #ifdef
  - Could be libraries, directives, languages, DSL,
  - Avoid vendor-specific constructs, directives, etc?





# Trame



# **Application Portability**

- Languages
  - Fortran?
  - Python?
  - C, C++?
  - UPC?
  - DSL?
  - Frameworks (Kokkos, Raja, Tida)



