# Lessons Learned from Selected NESAP Applications







#### **Helen He**

NCAR Multi-core 5 Workshop Sept 16-17, 2015



# The Big Picture

- The next large NERSC production system "Cori" will be Intel Xeon Phi KNL (Knights Landing) architecture
  - Self-hosted (not an accelerator). 72 cores per node, 4 hardware threads per core
  - Larger vector units (512 bits)
  - On package high-bandwidth memory (HBM)
  - Burst Buffer
- To achieve high performance, applications need to explore more on-node parallelism with thread scaling and vectorization, also to utilize HBM and burst buffer options.
- Hybrid MPI/OpenMP is a recommended programming model, to achieve scaling capability and code portability.







#### NERSC Exascale Science Application Program (NESAP)



- Goal: to prepare DOE Office of Science user community for Cori manycore architecture
- 20 applications were selected as Tier 1 (with postdocs) and Tier 2 applications to work closely with Cray, Intel and NERSC staff. Additional 26 Tier 3 teams. Share lessons learned with broader user community.
- Available resources are:
  - Access to vendor resources and staff including "dungeon sessions" with Intel and Cray Center of Excellence
  - Early access to KNL "whitebox" systems
  - Early access and time on Cori
  - Trainings, workshops, and hackathons
  - Intel Xeon Phi User Group (IXPUG)





#### **NESAP Code Coverage**





#### **Lessons Learned from Selected Applications**



 Presentation materials contributed by NERSC Application Readiness Team (NERSC Staff) and NESAP teams (application developers, NERSC liaisons, Cray Center of Excellence staff, and Intel staff)

| Application | Science Area      | PI                            | NERSC Liaison |
|-------------|-------------------|-------------------------------|---------------|
| BerkeleyGW  | Material Sciences | Jack Deslippe                 | Jack Deslippe |
| CESM        | Climate           | John Dennis                   | Helen He      |
| EmGeo       | Earth Science     | Gregory Newman                | Scott French  |
| NWChem      | Chemistry         | Wibe De Jong,<br>Eric Bylaska | Zhengji Zhao  |
| XGC1        | Fusion            | Choong-Seock<br>Chang         | Helen He      |





## **Recommended Optimization Path**



at the

# **Kernel Optimizations Examples**









#### **BerkeleyGW Optimization Steps**

- NERSC YEARS at the FOREFRONT
- Target more on-node parallelism. (MPI model already failing users)
- Ensure key loops/kernels can be vectorized.





Revision Number



-9-

## **Emgeo: 7 SpMV Kernel Variants**

- Span the space of likely optimizations to assess performance impact on non-KNL architectures
  - Alignment tweaks; Loop reordering, unrolling; Memory layout optimizations; Fortran "SIMD-ization"
- Ready for profiling when we have KNL access
- Winner: Only ~8% speedup over the original code
  - Only certain variants show vectorization speedup on HSW









## What does the code look like?





**\*\*** omitting alignment-related directives, etc. .S. DEPARTMENT OF

Office of

Science

show largest instruction count drop from AVX2 to AVX512.



# Improve OpenMP Scaling Examples









# XGC1: Remove "-heap-arrays 64" Compiler Flag

- This Intel compiler flag puts automatic arrays and temp of size 64 kbytes or larger on heap instead of stack.
- Surprisingly it slows down both the collision and pushe kernels by >6X.
- Allocation and access of private copies on the heap are very expensive.
- Does not affect explicit-shape arrays.
- Removed this flag for the collision kernel, and set OMP\_STACKSIZE to a large value
- Run time improves from 348 sec to 43 sec.
- Alternative: use !\$OMP THREADPRIVATE. Downside: data has to be static, not allocatable.





### **XGC1: Explore Nested OpenMP**



- Always make sure to use best thread affinity. Avoid using threads across NUMA domains.
- Currently:

export OMP\_NUM\_THREADS=6,4 export OMP\_PROC\_BIND=spread,close export OMP\_NESTED=TRUE Export OMP\_STACKSIZE=8000000 aprun -n 200 -N 2 -S 1 -j 2 -cc numa\_node ./xgca

• Is a bit slower than (work ongoing):

export OMP\_NUM\_THREADS=24 export OMP\_NESTED=TRUE export OMP\_STACKSIZE=8000000 aprun -n 200 -d 24 -N 2 -S 1 -j 2 -cc numa\_node ./xgca

- Refer to NERSC "Nested OpenMP" web page for achieving process and thread affinity using different compilers on different NERSC systems:
  - <u>https://www.nersc.gov/users/computational-systems/edison/running-jobs/</u> <u>using-openmp-with-mpi/nested-openmp/</u>







- Plane wave Lagrange multiplier
  - Many matrix multiplications of complex numbers, C = A x B
  - Smaller matrix products: FFM, typical size 100x10,000x100
  - Original threading scaling with MKL not satisfactory
- OpenMP "Reduce" or "Block" algorithm
  - Distribute work on A and B along the k dimension
  - A thread puts its contribution in a buffer of size m x n
  - Buffers reduced to produce C
  - OMP teams of threads







#### **NWChem: OpenMP "Reduce" Algorithm**



- Better for smaller inner dimensions, i.e. for FFMs
- Multiple FFMs can be done concurrently in different thread pools
- Threading enables us to use all 240 hardware threads
- Best Reduce: 10 MPI, 6 teams of 4 threads



MKL

Best "Reduce" 10 MPI, 6 teams of 4 threads







## **NWChem: OpenMP Scaling in CCSD(T)**



- Double terms usually dominate in (T) term
- Other terms become new performance bottleneck on manycore architectures - Amdahl' s Law







## **NWChem: OpenMP Scaling in CCSD(T)**

- NERSC YEARS at the FOREFRONT
- Threading enables us to use all 240 hardware threads
- Optimized code performs 2.5X better than baseline
- Up to 65X better compared to 1 MPI rank







# **Vectorization Examples**









## **XGC1: Collision Kernel**

Split dimensions, interchange array index, unroll loops, 40% kernel speedup



Optimized



enddo



Original



#### **BerkeleyGW**

Science

3X faster on SandyBridge, 8X faster on KNC



```
!$OMP DO reduction(+:achtemp)
                                                                                    ngpown typically in
do my_igp = 1, ngpown
                                                                                    100's to 1000s.
 ...
 do iw=1,3
                                                                                    Good for many
                                                                                    threads.
  scht=0D0
  wxt = wx_array(iw)
                                                                                    Original inner loop.
  do ig = 1, ncouls
                                                                                    Too small to
   !if (abs(wtilde_array(ig,my_igp) * eps(ig.my_igp)) .lt. TOL) cycle
                                                                                    vectorize!
   wdiff = wxt_wtilde_array(ig,my_igp)
   delw = wtilde_arrev(ig,my_igp) / wdiff
                                                                                    ncouls typically in
    ...
                                                                                    1000s - 10,000s.
   scha(ig) = mygpvar1 * aqsntemp(ig) * delw * eps(ig,my_igp)
   scht = scht + scha(ig)
                                                                                    Good for
                                                                                    vectorization.
  enddo ! loop over g
  sch array(iw) = sch array(iw) + 0.5D0*scht
 enddo
                                                                                    Attempt to save
                                                                                    work breaks
 achtemp(:) = achtemp(:) + sch_array(:) * vcoul(my_igp)
                                                                                    vectorization and
                                                                                    makes code slower.
enddo
                 Office of
```

#### **CESM MG2 Kernel: OMP SIMD ALIGNED**



- !\$OMP SIMD ALIGNED (...)
  - OpenMP standard, portable
  - Tells the compiler that particular arrays in the list are aligned
  - Asserts there are no dependencies
  - Requires to use PRIVATE or REDUCTION clauses to ensure correctness
  - Forces the compiler to vectorize, whether or not it thinks if it helps performance.

#### • !DIR\$ ASSUME\_ALIGNED (...)

- Tells the compiler that particular arrays in the list are aligned
- Intel specific, not portable

#### • !DIR\$ VECTOR\_ALIGNED

- Tells the compiler all arrays in a loop are aligned
- Intel specific, not portable





#### **CESM MG2 Kernel: OMP SIMD ALIGNED**



- Using the "ALIGNED" attribute achieved 8% performance gain when the list is explicitly provided.
- However, the process is tedious and error-prone, and often times impossible in large real applications.
  - !\$OMP SIMD ALIGNED added in 48 loops in MG2 kernel, many with list of 10+ variables
- Inquired with Fortran Standard:
  - Equivalent of "!\$DIR ATTRIBUTES ALIGNED: 64 :: A"
    - C/C++ standard: float A[1000] \_\_attribute\_\_((aligned(64)));
    - Not in Fortran standard yet
  - Equivalent of the "-align array64byte" compiler flag
    - Exist in Intel (Fortran only) and Cray compilers
    - What about other compilers?





# **Using HBM Examples**











- Identify the candidate (key arrays) for HBM
  - VTune Memory Access tool can help to find key arrays
  - Using NUMA affinity to simulate HBM on a dual socket system
  - Use FASTMEM directives and link with jemalloc/memkind libraries

On Edison (NERSC Cray XC30): **real, allocatable** :: a(:,:), b(:,:), c(:) **!DIR\$ ATTRIBUTE FASTMEM** :: a, b, c % module load memkind jemalloc % ftn -dynamic -g -O3 -openmp mycode.f90 % export MEMKIND HBW NODES=0 % aprun -n 1 -cc numa node numactl -membind=1 --cpunodebind=0 / myexecutable On Haswell: % numactl --membind=1 --cpunodebind=0 ./ myexecutable Office of Science

| Application | All<br>memory<br>on far<br>memory | All<br>memory<br>on near<br>memory | Key<br>arrays on<br>near<br>memory |
|-------------|-----------------------------------|------------------------------------|------------------------------------|
| BerkeleyGW  | baseline                          | 52%<br>faster                      | 52.4%<br>faster                    |
| EmGeo       | baseline                          | 40%<br>faster                      | 32%<br>faster                      |
| XGC1        | baseline                          |                                    | 24%<br>faster                      |



#### **Conclusions**



- NERSC is bringing a lot of resources to help users: training, postdocs, Cray and Intel staff, deep dive sessions.
- Optimizing code for Cori will likely require good OpenMP scaling, Vectorization and/or effective use of HBM.
- Applications can optimize on SandyBridge, IvyBridge, Haswell, and KNC architectures to prepare for Cori.
- Always profiling and understand your code first on where to work on improving performance. Use tools such as VTune, vector advisor.
- Creating kernels is much more efficient than working on full codes.
- Optimizing your code targeting KNL will improve performance on all architectures.
- Keep portability in mind, use portable programming models.







#### Thank you.



