Cori Application Readiness Strategy and Early Experiences





### March, 2016





## **Code Coverage**





## **Resources for Code Teams**

## • Early access to hardware

- Access to Babbage (KNC cluster) and early "white box" test systems expected in 2015
- Early access and significant time on the full Cori system

## • Technical deep dives

- Access to Cray and Intel staff on-site staff for application optimization and performance analysis
- Multi-day deep dive ('dungeon' session) with Intel staff at Oregon Campus to examine specific optimization issues

## • User Training Sessions

- From NERSC, Cray and Intel staff on OpenMP, vectorization, application profiling
- Knights Landing architectural briefings from Intel
- NERSC Staff as Code Team Laisons (Hands on assistance)
- 8 Postdocs





## **NESAP** Postdocs





Taylor Barnes Quantum ESPRESSO



Mathieu Lobet



Brian Friesen



Tuomas Koskela XGC1



Andrey Ovsyannikov Chombo-Crunch



Tareq Malas EMGeo







## **NERSC Staff associated with NESAP**







**Richard Gerber** 

Brian Austin



Zhengji Zhao



Helen He



Ankit Bhagatwala



Stephen Leak



Katie Antypas



Woo-Sun Yang

Rebecca Hartman-Baker

Doug Doerfler

Jack Deslippe

Brandon Cook



Thorsten Kurth

**Target Application Team** Concept (1 FTE Postdoc +) 0.2 FTE AR Staff

0.25 FTE COE

**1.0 FTE** User Dev.

1 Dungeon Ses. + 2 Week on site w/ **Chip vendor staff** 





## Timeline







## Timeline



rerere

BERKELEY LAB





## Working With Vendors

NERSC Is uniquely positioned between HPC Vendors and HPC Users and Applications developers.

NESAP provides a power venue for these two groups to interact.



### Dungeon Session Speedups (From Session

## What Has Gone Well

- 1. Setting requirements for Dungeon Session motivates teams to get started early and improves quality of dungeon session.
- 2. Engagement with IXPUG and user communities (Exascale Workshops at CRT)
- 3. Large number of NERSC and Vendor Training (Vectorization, OpenMP, Tools/Compilers) Well Received
- 4. Learned a Massive Amount about Tools and Architecture (VTune, SDE, HBM etc.)
- 5. Vendor staff helpful to work with. Very pro-active.
- 6. Pipelining Code Work Via Cray and Intel resources





#### All memory All memory Key arrays on far on near on near Application memory memory memory BerkeleyGW baseline 52% faster 52.4% faster FmGeo 40% faster 32% faster baseline XGC1 24% faster baseline

## What Has Gone Well (Cont)

7. Bandwidth sensitive applications that live in HBM expected to perform very well.



The N9 workload analysis shows a large fraction of jobs use < 16GB of memory per node

- 8. A lot of Lessons Learned: techniques to place key-arrays in fast-memory, improve prefetching effectiveness, coping without L3 cache etc...
- 9. CPU Intensive tasks (BGW GPP Kernel) expected to perform well (> Haswell) on KNL.
- 10. Postdocs deeply engaged.



### Version 1

- Simplify expressions to minimize #operations
- Use internal GAMMA function

### Version 2

- Remove "elemental" attribute, move loop inside.
- Inline subroutines. Divide, fuse, exchange loops.
- Replace assumed shape arrays with loops
- Replace division with inversion of multiplication
- Remove initialization of loops to be overwritten later
- Use more aggressive compiler flags
- Use profile-guided optimization (PGO)

### Version 3 (Intel compiler only)

Office of Science

• Use !\$OMP SIMD ALIGNED to force vectorization







#### Original

```
real(8), dimension
   (5, (col f nvr-1)*(col f nvz-1),
   (col f nvr-1)*(col f nvz-1)) :: Ms
do index ip = 1, mesh Nzml
  do index jp = 1, mesh Nrm1
    index 2dp = index jp+mesh Nrml*(index ip-1)
    tmp vol = cs2%local center volume(index jp)
    tmp f half v = f half(index jp, index ip) *
   tmp vol
    tmp dfdr v = dfdr(index jp, index ip) *
   tmp vol
    tmp dfdz v = dfdz(index jp, index ip) *
   tmp vol
    tmpr(1:3) = tmpr(1:3) +
   Ms(1:3, index 2dp, index 2D)* tmp f half v
    tmpr(5) = tmpr(5) +
   Ms(4, index 2dp, index 2D)*tmp dfdr v +
```

#### Optimized

```
real (8), dimension
   ((col f nvr-1), 5, (col f nvz-1),
   (col f nvr-1)*(col f nvz-1)) :: Ms
do index ip = 1, mesh Nzml
  do index jp = 1, mesh Nrm1
    index 2dp = index jp+mesh Nrm1* (index ip-1)
    tmp vol = cs2%local center volume(index jp)
    tmp f half v = f half(index jp, index ip) *
   tmp vol
    tmp dfdr v = dfdr(index jp, index ip) * tmp vol
    tmp dfdz v = dfdz(index jp, index ip) * tmp vol
    tmpr(index_jp,1) = tmpr(index jp,1) +
   Ms(index_jp,1,index_ip,index_2D)*
   tmp f half v
    tmpr(index_jp,2) = tmpr(index_jp,2) +
   Ms(index jp,2,index ip,index 2D)*
   tmp f half v
    tmpr(index_jp,3) = tmpr(index_jp,3) +
   Ms(index_jp,3,index_ip,index_2D)*
   tmp f half v
    tmpr(index jp,5) = tmpr(index jp,5) +
   Ms(index_jp,4,index_ip,index_2D)*
                                              tmp dfdr v
                                               tmp_dfdz_v
```

# ~40% speed up for kernel

Example From Cray COE Work on XGC1





### XGC1 (NERSC Lead Helen He Ankit Bhagatwala)







YEARS

at the OREFRONT

### EMGEO (NERSC LEAD Scott French, Thorsten Kurth, Tareq Malas)

```
subroutine ell spmv(mat, ind, x, z, m, n, ndiag)
  implicit none
 integer :: m, n, ndiag
 integer, dimension(ndiag, m) :: ind
 complex*16, dimension(n) :: x
 complex*16, dimension(m) :: z
 complex*16, dimension(ndiag, m) :: mat
 integer :: i, j
 complex*16 :: ztmp
!$omp parallel do private(ztmp)
 do i = 1, m
   ztmp = (0.0d0, 0.0d0)
   do j = 1, ndiag
     ztmp = ztmp + mat(j,i) * x(ind(j,i))
   end do
   z(i) = ztmp
 end do
```

Vector loads when vectorized in i

```
!$omp parallel do private(ztmp)
 do i = 1, 2 * nx * ny
   ztmp = (0.0d0, 0.0d0)
   do j = 1, ndiag
     ztmp = ztmp + mat(i,j) * x(ind(i,j))
   end do
   z(i) = ztmp
 end do
!$omp parallel do private(ztmp)
 do i = 2 * nx * ny + 1, m - nx * ny
   ztmp = (0.0d0, 0.0d0)
   ! stride 1
   ztmp = ztmp + mat(i, 1) * x(i - 2)
   ztmp = ztmp + mat(i, 2) * x(i - 1)
   ztmp = ztmp + mat(i,3) * x(i)
   ztmp = ztmp + mat(i, 4) * x(i + 1)
    ! stride nx
   ztmp = ztmp + mat(i, 5) * x(i - 2 * nx)
    ztmp = ztmp + mat(i, 6) * x(i - nx)
   ztmp = ztmp + mat(i, 7) * x(i)
   ztmp = ztmp + mat(i, 8) * x(i + nx)
    ! stride nx * ny
    ztmp = ztmp + mat(i, 9) * x(i - 2 * nx * ny)
   ztmp = ztmp + mat(i, 10) * x(i - nx * ny)
    ztmp = ztmp + mat(i, 11) * x(i)
   ztmp = ztmp + mat(i, 12) * x(i + nx * ny)
   z(i) = ztmp
 end do
```

### VASP (NERSC LEAD Zhengji Zhao)







Estimating the performance impact of HBW memory to VASP code via FASTMEM compiler directive and the memkind library on Edison



Test case: benchPdO2

VASP is a material science code that consumes the most computing cycles at NERSC.

This test used a development version of the VASP code.

Adding the FASTMEM directives to the code was done by Martijn Marsman at Vienna University Block AMR Framework.

Added "tiling" to improve data locality and improve OMP scaling on Xeon-Phi.

| SAMP | PLE WALL TIME N | IEASUREMENTS (SEC)         |             |
|------|-----------------|----------------------------|-------------|
| 0    | Iteration #     | Outer-loop-level threading | Loop tiling |
| 1    | 1               | 3.76                       | 0.58        |
| 2    | 2               | 3.87                       | 0.69        |
| 3    | 3               | 3.86                       | 0.70        |
| 4    | 4               | 3.86                       | 0.69        |
| 5    | 5               | 3.84                       | 0.68        |

Now exploring in transit analysis using specialized analysis ranks and burst buffer.











### Quantum ESPRESSO (NERSC Lead Taylor Barnes / Jack Deslippe)

*How improve a code where most FLOPs occur in libraries?* 

Targeting Exact Exchange Problems. Characterized by many parallel FFTs.

Strategy:

Improve on-node performance by increase the on-noe FLOP density and reducing inter-node communication. Moving individual FFTs to a single node shared memory model and exploiting new band-pair parallelism with MPI.







Quantum ESPRESSO (NERSC Lead Taylor Barnes / Jack Deslippe)



How improve a code where most FLOPs occur in libraries?

Targeting Exact Exchange Problems. Characteriz many parallel FFTs.

Strategy:

Improve on-node performance by increase the on-noe FLOP density and reducing inter-node communication. Moving individual FFTs to a single node shared memory model and exploiting new band-pair parallelism with MPI.

Exploit parallelism not used by default in app.



