NERSCPowering Scientific Discovery Since 1974

VASP Case Study

Code description and computational problem

The Vienna Ab-initio Simulation Package (VASP) [1-2] is a widely used materials science application for performing ab-initio electronic structure calculations and quantum-mechanical molecular dynamics (MD) simulations using pseudopotentials or the projector-augmented wave method and a plane wave basis set. VASP computes an approximate solution to the many-body Schrödinger equation, either within the Density Functional Theory (DFT) to solve the Kohn-Sham equation or the Hartree-Fock (HF) approximation to solve the Roothaan equation. Hybrid functionals that mix the HF approach with DFT are implemented, and Green's functions methods(GW quasi-particles and ACFDT-RPA) and many-body perturbation theory (2nd-order Møller-Plesset) are available in VASP as well.

The fundamental mathematical problem that VASP solves is a non-linear eigenvalue problem that has to be solved iteratively via self-consistent iteration cycles until a desired accuracy is achieved. This application makes use of efficient iterative matrix diagonalization techniques like the residual minimization method with direct inversion of the iterative subspace (RMM-DIIS) and the blocked Davidson algorithms. FFTs and Linear Algebra libraries (BLAS/LAPACK/ScaLAPACK) are heavily depended on. Currently, the released production code (5.3.5) is a pure MPI code, and the developers have been working on adding the OpenMP directives to prepare for the next generation multi-core/many-core HPC systems.

VASP is one of the Tier-3 NESAP [3] codes. In October 2015, the code team participated in a multi-day intense application optimization session called a "dungeon session" at Intel Offices in Hillsboro, and this case study is a report for the VASP dungeon session. 

The work described here was performed by the VASP developer, Martijn Marsman, at the University of Vienna, Zhengji Zhao and Jack Deslippe at NERSC, and Jeongnim Kim, Martyn Corden, Sumedh Naik, Christopher Cantalupo, Gregory Junker, Ruchira Sasanka and other engineers at Intel.

Starting point: Preparing for the dungeon

In order to attend a dungeon session, NESAP teams had to fill out the Dungeon Application Worksheet provided by NERSC, which was designed to guide the teams through choosing a science problem, profiling their application, generating kernels and identifying bottlenecks and an optimization strategy for their applications. Since the most of the work in VASP is done by libraries such as FFTW, BLAS, and ScaLAPACK (or ELPA), it was difficult for the team to extract self contained kernel codes from the hotspots in the code, and they decided to work with the full up-to-date (9/29/2015) development version of VASP, which was an MPI/OpenMP hybrid code. Due to the fact that the primary optimization focus of the VASP dungeon session was on-node performance, the developers chose the test system shown in Fig. 1, which fits on a single Xeon node and represents a typical code path for VASP. 

PdO2

Figure 1. The benchmark test case used in profiling the MPI/OpenMP hybrid VASP code (the development version up-to 9/29/2015). The test system contains 150 Palladium (Pd) atoms and 24 Oxygen atoms (O) (denoted as PdO@Pd-slab hereafter), 1644 electrons in 1024 bands, and 33967 planewaves per band. The RMM-DIIS iteration scheme was tested, and it was executed over multiple bands simultaneously. The input deck can be found here.  

The profiling of the hybrid code showed a reasonable scaling up to 8 threads (Fig. 2), where the performance of the hybrid code matched the pure MPI code. The best performance was achieved with 4 threads per MPI task (15% speedup in comparison to the pure MPI code). Beyond 8 threads and up, the poor thread performance of the FFTs (threaded 3d cfftw from MKL), BLAS1 calls (in eddrmm), and an increase in the MPI_alltoall costs with decreasing number of MPI ranks caused the code to stop scaling.

VASP thread scaling

Figure 2. The thread scaling of the MPI/OpenMP VASP code with the test case PdO@Pd-slab.  These are the fixed core (16 cores) tests running on one of the sockets on an Intel dual-socket Haswell node (Xeon E5-2697 v3 2.6 GHz) at the University of Vienna. The horizontal axis shows the number of MPI tasks and the OpenMP threads per tasks, and the vertical axis shows the total run time of the code (blue) and the run time breakdown for the top subroutines in the code.

The run time comparison between running on fully packed and half packed nodes (Fig. 3) showed that the most of the subroutines such as FFTs (fftwav_mpi and fftext_mpi), the routines that map to the ZGEMM (lincom and orth1), and the DGEMM (rpromu and racc0mu) were likely bandwidth bound, especially the real-space projection routines rpromu and racc0mu (the bars in the blue box). They had the most run time reduction (30-35%) when the memory bandwidth available per task was doubled by running on half-packed nodes. The FFTs work on contiguous data structures. The ZGEMM calls work on contiguous data structures and is blocked to attempt to reach peak performance. Unfortunately, rpromu and racc0mu access their relevant input and output data through an index table (gather/scatter), and could be a contributing factor for the higher bandwidth demand on these two routines. 

VASP BW

Figure 3. The VASP run time breakdown over the top subroutines when running on the fully packed (blue) and half-packed (red) nodes. The tests were done on an Intel Xeon E5-2697 v3 node and used the test case PdO@Pd-slab. When running on half packed nodes, the memory bandwidth available for each task is twice as much as it is when running on fully packed nodes. This experiment was designed to test if an application code is bandwidth bound. 

The run time comparison between running on two different clock speeds (Fig. 4) showed that the most of the subroutines were likely CPU bound as well. However, the two real-space projection routines, rpromu and racc0mu, hardly benefited from the increased clock-speed (the bars in the blue box), indicating that they are firmly memory bandwidth bound (also Fig. 3). This was confirmed by further analysis with the VTune memory bandwidth tests and suggests that the two routines would likely get the most performance benefit from allocation to the high bandwidth memory (HBM or MCDRAM) that will be available on the Cori Phase II KNL nodes.

 

Figure 4. The VASP run time breakdown over the top subroutines when running at the clock speeds of 2.5 GHz, and 1.9 GHz on an Intel Xeon E5-2697 v3 node. The test case used was PdO@Pd-slab, and this experiment was designed to test if the code is CPU bound.

Alongside the preparation work done by the developers, the liaison for the VASP code team at NERSC worked on estimating the performance benefit of HBM with a production VASP code (version 5.3.5). This was done using the Intel development tools Memkind and AutoHBW [4-7] as well as the dual-socket Ivy Bridge nodes on Edison. The Memkind and AutoHBW libraries can be used to allocate heap arrays to HBM. The near and far socket memory accessed via QPI links on the dual-socket node (Ivy Bridge) on Edison can serve as proxies to the HBM and DDR memories on a KNL node. Although emulating HBM on the dual-socket Ivy Bridge node is not an accurate model of the bandwidth and latency characteristics of the KNL on package memory (HBM or MCDRAM), it is a reasonable way to determine which data structures rely critically on bandwidths. The far socket memory accessed via the QPI links has a 33% reduced bandwidth than that of the near socket.

The Memkind library is a user extensible heap manager built on top of Jemalloc which enables control of memory characteristics and partitioning of the heap between different kinds of memory (including user defined kinds of memory).  Using the Intel compiler directive !DIR$ ATTRIBUTES FASTME for Fortran codes and/or the malloc wrapper API hbw_malloc for C/C++ codes, one can allocate the selected arrays to the HBM by linking the application codes to the Memkind library. Another way to do this is through the AutoHBW, a library used to automatically allocate arrays to the HBM without any modifications to the application source codes. This library intercepts the standard heap allocations (e.g., malloc, calloc) in the applications so that it can allocate them in the HBM. To use this tool, an environment variable such as AUTO_HBW_SIZE must be used to specify the size range for the arrays allocated to the HBM.

The estimated HBM performance impact to the VASP code is shown in Fig. 5. With only a 33% difference in bandwidths, the total run time of the VASP code was reduced by about 40% when all the arrays sized from 1M to 5M were allocated to the HBM. Given the fact that the HBM on a KNL node has a bandwidth five times the size of the DDR memory's, this experiment was very encouraging to the developers and motivated them to look into HBM optimization during the dungeon session. To efficiently make use of the limited amount of HBM on a KNL node, it is critical to only allocate the arrays that generate the most memory traffic.

 

 

 

 

 

 

Figure 5. The simulated HBM performance impact to the VASP hybrid functional (HSE06) calculations on Edison. The production code VASP 5.3.5 (pure MPI code) was used. The tests were run with 4 Edison nodes, and had 48 tasks in total. The horizontal axis shows the size range of the arrays that were allocated to the HBM using the AutoHBW library tool, and the vertical axis shows the total run time of the VASP code. The leftmost bar (All DDR) shows the total runtime when everything is allocated to DDR memory (All DDR), and the rightmost bar (All HBM) shows the run time when everything is allocated to the HBM. The benchmark test case was a system with 105 Boron atoms shown on the right, and the input deck can be found here.

In addition, the developers learned during the DFT day organized by NERSC at the end of the IXPUG 2015 conference
that Libxsmm [8], an optimized library for small matrix matrix multiplication and outperformer of MKL when the matrix size is sufficiently small (e.g., n<40), could be an optimal library choice for VASP where many small matrixes and matrix multiplications are repeatedly invoked. 

Code optimizations during the dungeon session

Based on the pre-dungeon work and discussions between developers, NERSC staff, and Intel Engineers, the following optimizations were attempted during the dungeon session:

  • Analyzed the HBM performance impact by allocating a few selected arrays to HBM using the Memkind library and the Intel compiler directive !DIR$ ATTRIBUTES FASTMEM. 
  • Attempted to optimize the real space projection routines and developed a four-step optimization strategy. Two of the steps were implemented during the dungeon session and the rest remained to be used after.
  • Attempted to use Libxsmm. However, developers found that it cannot be used until JIT is fully supported.

The team also analyzed the bandwidth with VTune and attempted to understand the performance data obtained from pre-dungeon profiling.

Allocating arrays in HBM using the Memkind library and the FASTMEM Intel compiler directive 

On a KNL node (72 cores and 288 hardware threads) on Cori, there will be a 16 GB HBM and a 96 GB DDR memory. To efficiently use the relatively small amount of available HBM, it is important to only allocate the arrays that generate the most memory traffic to the HBM. Based on pre-dungeon analysis, the developers decided to allocate the three largest arrays in the real-space projection routine RACC0MU to the HBM. Since VASP is written in Fortran 90, one must add the Intel compiler directive FASTMEM in front of the arrays to use the Memkind library to allocate them to the HBM. Note that the Memkind library is not capable of allocating arrays that are allocated in the stack memory such as the automatic arrays in Fortran and the OpenMP private arrays. Adding the FASTMEM directive to the code is straightforward if the arrays to be placed in the HBM are allocatable. For example, in the code example below,

  SUBROUTINE RACC0MU(NONLR_S, WDES1, CPROJ_LOC, CRACC, LD, NSIM, LDO)
...
    REAL(qn),ALLOCATABLE:: WORK(:),TMP(:,:)
    GDEF,ALLOCATABLE    :: CPROJ(:,:)
...
    ALLOCATE(WORK(ndata*NSIM*NONLR_S%IRMAX),TMP(NLM,ndata*2*NSIM),CPROJ(WDES1%NPRO_TOT,NSIM))
...
  END SUBROUTINE RACC0MU

the only change needed to allocate the arrays WORK, TMP, and CPROJ to the HBM was to add the following line to the code right below the GDEF,ALLOCATABLE :: CPROJ(:,:)). 

    !DIR$ ATTRIBUTES FASTMEM :: WORK,TMP,CPROJ

However, VASP uses array pointers. To allocate the arrays associated with array pointers, some extra work arrays were needed. For example, in the following code example,

INTEGER, POINTER :: NLI (:,:) ! index for gridpoints
REAL(qn),POINTER, CONTIGUOUS :: RPROJ (:) ! projectors on real space grid
COMPLEX(q),POINTER, CONTIGUOUS::CRREXP(:,:,:) ! phase factor exp (i k (R(ion)-r(grid)))

...

ALLOCATE(NONLR_S%NLI(IRMAX,NIONS),NONLR_S%RPROJ(NONLR_S%IRALLOC))
ALLOCATE(NONLR_S%CRREXP(IRMAX,NIONS,1))

the developers had to use the allocatable arrays fm_NLI, fm_RPROJ and fm_CRREXP to allocate the arrays pointed by the three array pointers NLI, RPROJ, and CRREXP to the HBM. In addition, the !DIR$ ATTRIBUTES FASTMEM directive was added in front of them as shown in the code below.

INTEGER, POINTER :: NLI (:,:) ! index for gridpoints
REAL(qn),POINTER, CONTIGUOUS :: RPROJ (:) ! projectors on real space grid
COMPLEX(q),POINTER, CONTIGUOUS::CRREXP(:,:,:) ! phase factor exp (i k (R(ion)-r(grid)))

! To exploit fastmem
INTEGER, ALLOCATABLE :: fm_NLI (:,:) ! index for gridpoints
REAL(qn), ALLOCATABLE :: fm_RPROJ (:) ! projectors on real space grid
COMPLEX(q), ALLOCATABLE :: fm_CRREXP(:,:,:) ! phase factor exp (i k (R(ion)-r(grid)))

!DIR$ ATTRIBUTES FASTMEM :: fm_RPROJ,fm_CRREXP,fm_NL

...

! put NLI and RPROJ into fastmem
ALLOCATE(NONLR_S%fm_NLI(IRMAX,NIONS),NONLR_S%fm_RPROJ(NONLR_S%IRALLOC))
NONLR_S%NLI =>NONLR_S%fm_NLI
NONLR_S%RPROJ=>NONLR_S%fm_RPROJ

ALLOCATE(NONLR_S%fm_CRREXP(IRMAX,NIONS,1))
NONLR_S%CRREXP=>NONLR_S%fm_CRREXP

After the FASTMEM directive was added to the code, it was compiled and run on Edison. Fig. 6 (upper panel) shows the comparison between the run time when everything was allocated in the DDR memory (blue:All DDR), when only a few selected arrays were allocated to HBM (red:Mixed), and when all arrays were allocated to HBM (green:All HBM). About 8% of speedup was achieved when a few selected arrays were allocated to HBM in comparison to when everything was allocated in the DDR memory. Further looks into the run time breakdown over the top subroutines revealed that allocating a few arrays in HBM (red:Mixed) showed a promising speedup (up to 30%) with the BLAS1-like subroutines (middle panel), while the real-space projection subroutine RACC0MU (lower panel) had a modest speedup (7%) in contrast. A further speedup of RACC0MU was expected of the developers (up to the level of the results obtained with everything in HBM) if a more careful selection of the arrays to be allocated to the HBM was attempted.

VASP BLAS1 CAT3

 

Figure 6. VASP performance comparison between runs when everything was allocated in the DDR memory (blue/slow), when only a few selected arrays were allocated to HBM (red/mixed), and when everything was allocated to HBM (green/fast). The test case PdO@Pd-slab was used, and the tests were run on a single Edison node.

Developing a four-step optimization plan for the real space projection routines

Another optimization attempted was on the real-space projection routines, which have the three major loops shown in the pseudo code in Fig. 7 (left panel):

 

Figure 7. A real-space projection subroutine code structure (left). The right panel illustrates the three loop regions over the ions/projectors, the number of bands in a group, and the number of grid points in each group.

This piece of the code implements the dot product between the values on the spheres to a wavefunction that lives in the entire grid and is one of the hotspots in the code as shown in Fig. 3 and Fig. 4. A four-step optimization strategy was developed during the dungeon session (Fig. 8). First, make sure the wavefunction ψand a work array W are placed in the HBM since this routine is memory bandwidth bound. Second, reorder the two loops over the ions and the bands in the block so that the wavefunction ψn, which does not depend on ions (the index k), can be reused. Next, to further improve the data locality, sort over the ions and ψn  so that a given region of a wavefunction as well as a whole given wavefunction can be reused. Lastly, vectorize the inner loops by taking advantage of the stride 1 block buried in the index array in the sphere. The best way of doing this is to pad the sphere into a cube.

Optimization description
optimization steps

 

Figure 8. The four-step optimization plan for real-space projection routines. 

The optimization implemented in the dungeon session focused on the inner loop vectorization that the compiler failed to do. Instead of padding the sphere into a cube, which requires significant rewrite of the code, we tried to detect the stride 1 block in the index array, and vectorized loops over those indices only. The remainder was unchanged. For example, for the following loop,

do i = 1, n
A(Index(i))=B(i)

we rewrote it as

while (i < n):
ip = Index(i)
if (Index(i+veclength) == ip + veclength:
do j = 1, veclength
A(ip+j) = B(i+j)
i += veclength
else ...

Using the opt-report from Intel compiler and VTune, we were able to confirm the vectorization, although alignment was an issue. However, we did not see any speedup with this vectorization alone because the routine was bandwidth bound. We did not allocate any arrays in HBM when this vectorization was attempted.

Post dungeon follow-up

  • After the dungeon session, the developers added the Intel FASTMEM directive to the code with careful selection of the arrays placed to the HBM. The resulting code is KNL ready as far as the HBM is concerned.
  • Since the dungeon session, an optimization effort led by Intel engineers in collaboration with the developers, NERSC and ZIB, has been ongoing. Major progress towards this effort was the profiling of the code on early KNL nodes. 
  • After the dungeon session, an Intel compiler expert who worked in the code team filed a feature request for the Intel compiler to make OpenMP private copies allocate in HBM if the parent copy in the master thread is in HBM (has the FASTMEM attribute).

Lessons learned

  • HBM may have a significant performance benefit to applications. Selectively allocating arrays to HBM is a key optimization tactic to use the small amount of available HBM efficiently on KNL nodes. The VTune profiling, especially the memory-access analysis, is useful to identify the arrays that generate the most memory traffic in the code. These arrays are good candidates for allocation to the HBM.
  • Intel development tools like Memkind and AutoHBW are very helpful for users and developers to carry out HBM optimization for KNL on today's architectures. Early adoption of these tools is key to produce the KNL ready codes.
  • The available tools such as Memkind and AutoHBW are only capable of allocating heap arrays in HBM, and there is no good way of allocating stack arrays so far. This prevents OpenMP private arrays from using HBM. It is also necessary to change stack arrays (where applicable) to allocatable arrays to use HBM. Although we observed some performance slowdown associated with this change, the benefit from HBM should overwrite this.
  • The Intel compiler directive !DIR$ ATTRIBUTES FASTME does not work with array pointers directly. Allocatable work arrays must be introduced and allocated to the HBM first before association to the array pointers.
  • The FASTMEM directive is an Intel compiler specific directive, and is not supported by other compilers. This may cause some problems with the code optimization portability. Cray compilers may support this directive (with slight modification) in the future.
  • Restructuring codes by changing loop orders among the nested loops may improve the data locality.
  • VTune and compiler optimization (-qopt-report) output are very helpful for checking if a specific loop is vectorized or not. Vectorization is possible by taking advantage of the stride 1 block buried in the index array by padding.  

References

  1. VASP:  http://www.vasp.at/
  2. VASP:  G. Kresse and J. Furthm_ller. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Comput. Mat. Sci., 6:15, 1996
  3. NESAP:  http://www.nersc.gov/users/computational-systems/cori/nesap/
  4. Memkind and Auto HBW tool:  http://memkind.github.io/memkind
  5. Memkind:  http://memkind.github.io/memkind/memkind_arch_20150318.pdf
  6. Memkind http://ihpcc2014.com/pdf/100_KNL_HPC_Developer_Forum_SC_14.pdf
  7. Libxsmm:  https://github.com/hfp/libxsmm