NERSCPowering Scientific Discovery Since 1974

Using OpenMP Effectively

Performance Implications of Using OpenMP

A combined team of NERSC and Cray staff has formed a Cray "Center of Excellence" to examine programming models beyond pure MPI on the 24 core Hopper nodes.  One of the key findings with hybrid MPI-OpenMP codes is that although OpenMP may not alter the performance of an application very much, using OpenMP can dramatically decrease memory usage, allowing larger problems to be addressed.  See the figures and explanations below.

When using hybrid OpenMP-MPI in applications NERSC strongly suggests running with 4 MPI tasks and 6 OpenMP threads each per node.  Using 1 MPI task and 24 OpenMP threads per node is NOT an optimal way to use the Hopper nodes, as it is very hard to address the performance implications of the Non-Uniform Memory Architecture (NUMA) of the Hopper nodes using more than 6 OpenMP threads.

Here we discuss some of the results of the Center of Excellence in more detail with examples based upon analysis of NERSC users codes.

Basic OpenMP Performance Considerations.

In order to optimally use the hybrid OpenMP-MPI programming model on Hopper it is essential to understand the basic architecture of Hopper nodes.  Each node contains two sockets, each with a 12-core AMD Magny-Cours in. One Magny-Cours is in fact a Multi-Chip Module containing two six-core dies. Thus each node is essentially a four-chip node. For a core to access the memory connected to a different die a performance penalty will be incurred. More details are here Basics of Hopper architectureRunning OpenMP jobs on Hopper  and the Hopper Multi-Core FAQ.

The performance implications of not paying attention to this are shown here. This shows the performance of the Stream benchmark (which measures memory bandwidth) using 1-24 cores of a Hopper node. If all of the memory is allocated on one die “TouchByOne” as opposed to all four “TouchByAll” there is a decrease in the available memory bandwidth by a factor of four when using all 24 cores.

In order to place the memory correctly one must access each page of memory on the same die as it is being used in the main part of the code. For the Stream benchmark this is straightforward:

Double a[N],b[N],c[N];
.....
#pragma omp parallel for
for (j=0; j<VectorSize; j++) {
      a[j] = 1.0; b[j] = 2.0; c[j] = 0.0;
    }
#pragma omp parallel for
for (j=0; j<VectorSize; j++) {
      a[j]=b[j]+d*c[j];
}

The first loop ‘first-touches’ the a, b and c arrays correctly for the working loop. This example is very simple and the origin of the above advice to use at most six OpenMP threads per MPI task on Hopper is because for more complicated codes, it may not be possible to ‘first-touch’ in a way that ensures the correct data locality for every routine in the code, as data structures may be accessed in different ways.

OpenMP Benefits and Drawbacks

As mentioned above, the principle benefit to using OpenMP is that it can decrease memory requirements, with, in most cases, almost equal performance. There are also several other benefits, including:

  •  Potential additional parallelization opportunities on top of those exploited by MPI.
  • Less domain decomposition, which can help with load balancing as well as allowing for larger messages and fewer tasks participating in MPI collective operations.
  • OpenMP is a standard, so any modifications introduced into an application are portable and appear as comments on systems not using OpenMP.
  • By adding annotations to existing code and using a compiler option, it is possible to add OpenMP to a code somewhat incrementally, almost on a loop-by-loop basis.  If you have a code that vectorizes well the vector loops are good candidates for OpenMP. 

There are also some potential drawbacks:

  •  OpenMP can be hard to program and/or debug in some cases.
  • When using more than six threads on Hopper it is very important to pay careful attention to NUMA & locality control.
  • If an application is network- or memory bandwidth-bound then threading is not going to help. In this case it will be OK to leave some cores idle.
  •  In some cases a serial portion maybe essential and that can inhibit performance.
  • In most MPI codes synchronization is implicit and happens when messages are sent and received but with OpenMP much synchronization must be added to the code explicitly.  The programmer must also explicitly determine which variables can be shared among threads and which ones cannot (the parallel scoping). OpenMP codes that have errors introduced by incomplete or misplaced synchronization or improper scoping can be difficult to debug because the error can introduce “race” conditions that cause the error to happen only intermittently.

In this section we present some cases studies based upon NERSC users codes to illustrate some of these points

 

 

Case Study: fvCAM

The Community Atmosphere Model (CAM) is the atmospheric component of the Community Climate System Model (CCSM) developed at NCAR and elsewhere for the weather and climate research communities. In this work we used CAM v3.1 with a finite volume (FV) dynamical core and a "D" grid (about 0.5 degree resolution) with 240 MPI tasks for 3 days simulated time (240 timesteps). CAM uses a formalism effectively containing two different, two-dimensional domain decompositions, one for the dynamics that is decomposed over latitude and vertical level and the other for remapping that is decomposed over longitude/latitude. Optimized transposes move data from the program structures between these decompositions.  The fvCAM source code already contained OpenMP directives and no further modifications were made for this work.

The figure shows the performance of this fvCAM benchmark in Hopper as a function of MPI tasks and OpenMP threads. In other words the performance of the code using a fixed number of nodes and differing amounts of MPI tasks and OpenMP threads.

As one decreases the number of MPI tasks and increases the number of OpenMP threads the overall memory usage decreases quite significantly. For example, using three threads and 80 MPI tasks, the performance decreases by only 6% compared to using single-threaded MPI tasks whereas the memory requirement is reduced by almost 50%. Using six threads incurs quite a substantial performance penalty, in principle because of a growth in the time taken in OpenMP parallelized parts of the code. For twelve threads there are also NUMA effects present, which adds to the performance decrease.

To understand in more detail the performance characteristics we measured the timings for each of the principal kernels of the code, namely the physics and the dynamics.

This figure shows the performance scaling for the physics component. The performance for each combination of MPI tasks and OpenMP threads is almost identical, indicating that for this part of the code the MPI and OpenMP parallelism is exactly equivalent. The physics part of the code acts on individual grid points separately and thus each OpenMP thread can work on a different grid point with no communication between. In this case even the twelve-thread case shows no performance degradation, which is because the code in this instance is able to ``first-touch'' each array correctly to avoid NUMA effects.

This figure shows the performance scaling for the dynamics component. In strong contrast to the physics case the performance is not at all constant as the number of OpenMP threads increases; it decreases. This is mostly because (part of) the dynamics portion of the code works on z-x slices of the domain where as the physics portion works on x-y slices. The coordinate system transformation between these two representations shows up in the timing of the dynamics piece of the code, and it is not well load-balanced amongst OpenMP threads. The matrix to be transposed is evenly divided along its dimensions between threads, which produces a different amount of work on each thread leading to a load-imbalance.

To summarize, in these experiments fvCAM shows reasonable performance scaling with respect to the number of OpenMP threads often with much decreased memory usage. The scaling is affected by the non-optimal threading implementation of the matrix-transpose which is part of the dynamics calculation.

Case Study: GTC – 3d Gyrokinetic Toroidal Code

GTC is a fully self-consistent, gyrokinetic 3-D Particle-in-cell (PIC) code with a non-spectral Poisson solver. It uses a grid that follows the field lines as they twist around a toroidal geometry representing magnetically confined toroidal fusion plasma. The version of GTC used here uses a fixed, 1-D domain decomposition with 64 toriodal domains and 24 particle domains for a total of 1536 MPI tasks. The benchmark runs are for 248 timesteps using 3,200 particles per grid cell.  The GTC source code already contained OpenMP directives and no modifications were made for this work.

The figure shows the performance and memory usage of GTC for this problem running on a total of 1,536 cores on Hopper while varying the number of MPI tasks and OpenMP threads simultaneously. As in the case of fvCAM, substantial reductions in memory usage occur with increasing number of OpenMP threads. In this case with three threads the memory usage is reduced by 50% and the performance is improved by 15%. Again, as in the fvCAM case the performance for 12 threads is substantially reduced, by approximately 60% compared to the best performing case because of NUMA effects. As the performance does not change significantly as the number of OpenMP threads and MPI tasks are varied it seems reasonable to conclude that the code has portions that are equivalently parallelized with respect to OpenMP and MPI. To assess if this is indeed the case we examined the performance for the three most time consuming kernels in the code.

From this figure it is apparent that although the overall performance of the code slightly increases as the number of OpenMP threads is increased this is due to a cancellation effect. The increase in the performance of the ``Poisson'' kernel is offset by the decrease in the performance of the ``pusher'' and the ``shifter'' kernels. This is principally because the ``Poisson'' routine is parallelized with respect to OpenMP but not MPI.

To summarize, GTC shows good scaling with respect to the number of OpenMP threads. This is due to cancellation between the speedup's obtained by some routines and the slow downs in others and illustrate the need to consider the performance of the whole application when examining the benefits of a hybrid MPI-OpenMP programming model.

Case Study: PARATEC – PARAllel Total Energy Code

PARATEC (PARAllel Total Energy Code) performs ab initio Density Functional Theory quantum-mechanical total energy calculations using pseudo-potentials, a plane wave basis set and an all-band (unconstrained) conjugate gradient (CG) approach. Part of the calculation is carried out in Fourier space; custom parallel three-dimensional FFTs are used to transform the wavefunctions between real and Fourier space. The benchmark used here is a 686-atom Si-atom system in a diamond lattice configuration with 20 conjugate gradient iterations runs on 192 and 384 total cores.

In contrast to the other two codes evaluated here PARATEC does not contain OpenMP directives in its source code. However, a profile of the code reveals that it spends a significant amount of time in a dense matrix-matrix multiplication routine (ZGEMM) which is part of the BLAS library and fortunately a threaded version of the BLAS routine is available in the Cray math library (-libscimc12_mp).

The figure shows the performance and memory usage breakdown for PARATEC, for two total core counts, 192 and 384. As PARATEC does not contain OpenMP source code directives the instrumentation technique used in the previous two cases does not work here, so instead we report the time spent in the two principle kernels of the code, the "ZGEMM" portion, which contains the call to the threaded ZGEMM routine, and the FFT portion which has no OpenMP parallelization. For the lower core count, 192, the performance is best for one OpenMP thread, which is due to the non-scalability of the FFT part of the code.  The ZGEMM part is scaling almost perfectly. In contrast, at 384 cores the performance has decreased by only 5% when using three threads. In this case although time in the compute portion of the FFT increases this is canceled out by the decrease in time spent communicating because of the smaller number of MPI tasks involved. For both overall core counts with 12 and 24 threads the NUMA effects present are very large, mostly it seems because the ZGEMM library routine is not NUMA aware. As in the other cases the memory savings are quite substantial, and again using three threads typically reduces the memory usage by a factor of two.

To summarize, by using a library based OpenMP modification to the way that PARATEC runs we were able to obtain substantial memory savings with hardly any reduction in performance.

Case Study: PMAMR

Pourous Media Adaptive Mesh Refinement (PMAMR) is an application being used to model carbon sequestration and contaminant transport as part of the Advanced Simulation Capability for Environmental Management (ASCEM) project at Lawrence Berkeley Laboratory. The goal of the ASCEM project is to better understand and quantify flow and contaminant transport behavior in complex geological systems. The PMAMR code is built upon the BoxLib framework and was previously parallelized using MPI. In this work we added the hybrid MPI/OpenMP programming model to the code to see what kind of performance gains could be achieved.

The figure show the results of adding OpenMP. Overall our hybrid approach yields a net speedup of 2.6× compared to the MPI only version, with substantially reduced memory usage. Thus decreasing the time to solution by increasing the rate at which the calculation runs.

OpenMP and precompiled application codes

NERSC provides a large number of precompiled Materials Science and Chemistry application codes, while majority of them are still flat MPI codes, eg., VASP, some of them have already had OpenMP directives added in the codes, eg., Quantum Espresso. To help to address the reduced per core memory on Hopper, and to take advantage of the imporved performance from using OpenMP, we chose two representative density functional theory (DFT) codes, VASP and Quantum Espresso codes, and analyzed their performance on Hopper upon the use of OpenMP and/or the multi-threaded BLAS library.

VASP:

VASP is an application for performing ab-initio quantum-mechanical molecular dynamics (MD) calculations using pseudopotentials and a plane wave basis set. Currently it is the most frequently used DFT code at NERSC. VASP is written in Fortran90 and is parallelized with MPI. The current version of VASP (5.2.11) does not have OpenMP implemented in the code. However, by linking the code to the Cray multi-threaded scientific library (-lsci_mc12_mp) we are able to get some measure of the effects of threading upon its performance. This is a low-effort OpenMP implementation. We mesured the runtime and the memory usage of the two most commonly used iteration schemes, the RMM-DIIS and the blocked Davidson, for electronic iterations with two test cases that NERSC users provided.

Figure 1 shows  the time spent on the first five electronic iterations (left) and the average per core memory used (right) during the execution of VASP (general k-point version). The test system contains 154 atoms (Zn48O48C22S2H34, 4 k-points) and was provided by a NERSC user. The stacked labels in the horizontal axis show the number of threads per MPI task and the corresponding number of MPI tasks. Where the red and blue bars are the results corresponding to the two iterative schemes, the RMM-DIIS and the blocked Davidson, respectively. All tests were run on 6 nodes (144 cores in total). The green and purple bars at threads = 3 are the results of running the flat MPI code on the same number of nodes but using only 8 cores per node (unpacked runs, threads = 1, MPI tasks = 48) using the RMM-DIIS and the Davidson schemes, respectively, for comparison. Note, the flat MPI code using all cores on the node (threads=1, MPI tasks=144) failed due to out of memory (OOM) error.

Figure 2 shows the time spent on the first five electronic iterations (left) and the average per core memory used (right) during execution of VASP (Gamma point only version). The test system contains 660 atoms (C200H230N70Na20O120P20) and was provided by a NERSC user. The stacked labels in the horizontal axis show the number of threads per MPI task and the corresponding number of MPI tasks. Where the red and blue bars are the results corresponding to the two iterative schemes, the RMM-DIIS and the blocked Davidson schemes, respectively. All tests were run on 32 nodes. The green and purple bars at threads = 2 are the results of running the flat MPI code on the same number of nodes but using only 12 cores per node (threads = 1, MPI tasks = 384) using the RMM-DIIS and the Davidson schemes, respectively, for comparison (unpacked runs).

From the figures shown above for both smaller (154 atoms) and larger (660 atoms) cases, we see that when the number of threads increases for a given number of nodes (hence the number of MPI tasks decreases correspondingly), VASP performs slower, especially when the number of threads larger than 6, VASP encounters a significant performance penalty due to the NUMA effect as discussed above. However, this deosn't mean that threads doesn't help at all, in fact they do help to reduce the memory usage (right figures above). Eg., in Figure 1, the flat MPI code ran out of memory on the fully pakced nodes (using all 24 cores on the node) while the multi-threaded VASP could run on the same number of nodes using all 24 cores availble. Comparing the fully packed (red and blue bars at threads=3) and the unpacked (using only 8 cores out of 24, purple and green bars) runs in Figure 1, we see that the threaded version runs upto 24% faster than the flat MPI version running on the same number of unpacked nodes.

To summarize, VASP is simply not spending enough time in the threaded BLAS routines to realize a benefit. However, for users who need to run larger memory VASP jobs that have to run on the unpacked nodes to allow the cores in use to get enough memory, we suggest them to use the multi-threaded VASP instead of leaving some of the cores idle (your account will be charged for the idle cores as well) and gain the potential performance benefit from using threads. Please refer to our VASP page regarding how to run the multi-threaded VASP on Hopper.

Quantum Espresso:

Quantum ESPRESSO (QE) is an integrated suite of computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials (both norm-conserving and ultrasoft). We used QE 4.2.1 which already has OpenMP implemented in the code. It uses multi-threaded BLAS and ScaLPACK routines from the Cray scientific library (-lsci_mc12_mp, version 10.5.01) and the FFT routines from ACML 4.4.0. We mesured the runtime and the memory usage of QE with three standard benchmark cases with the slight modifications regarding the file IO and the number of k-points.

Figure 3 shows the time spent on the first electronic step (left) and the average per core memory used (blue bars in the right figure) for the blocked Davidson iteration Scheme in QE 4.2.1, with a system conaining 686 atoms (C200Ir486). The stacked labels in the horizontal axis show the number of threads per MPI task and the corresponding number of MPI tasks. The purple bar at threads = 2 (left) shows the runtime of the flat MPI code on the half-packed nodes (threads = 1, MPI tasks = 720), and the red bar (right figure) shows the average per core memory used during execution of the flat MPI code on 120 half-packed nodes (threads = 1, MPI tasks = 1440) for comparison. Note, the flat MPI code using all cores on the node (threads=1, MPI tasks=1440) failed due to out of memory (OOM) error.

Figure 4 shows the time spent on the first two self-consistent electronic steps (left) and the average per core memory used (right) in QE. The iteration scheme tested was the damped MD method, with a system containing 1532 atoms (C1164O16N32H320). The stacked labels in the horizontal axis show the number of threads per MPI task and the corresponding number of MPI tasks. All tests were run on 68 nodes. The green bar at threads = 2 shows the runtime of the flat MPI code on the half-packed nodes (threads = 1, MPI tasks = 720) for comparison.

Figure 5 shows the time spent on the first two electronic steps (left) and the average per core memory used (right) for the blocked Davidson iteration scheme in QE, with a system conaining 112 Au atoms (AUSRUF112). The stacked labels in the horizontal axis show the number of threads per MPI task and the corresponding number of MPI tasks. All tests were run on 12 nodes. The blue bar at threads = 2 (left) shows the runtime of the flat MPI code on the half-packed nodes (threads = 1, MPI tasks = 144) for comparison.

From the figures shown above for all three test cases, we see that using OpenMP threads has improved the code performance upto 40% at low threads counts, eg., at threads=2 running on the same number of fully packed nodes .  And we also see a significant memory saving when the number of threads increases. In addition, in all three test cases,  the hybrid code runs upto 40% faster than the flat MPI version running on the same number of unpacked nodes. As in VASP code, when the number of threads larger than 6, QE also shows a significant performance penalty due to the NUMA effect.

To summarize, the QE code does contain OpenMP extensions. Our assessment shows that this produces performance gains of up to 40% compared to the flat MPI version running on the same number of cores. Also in every case we examined the best performance was achieved using two OpenMP threads, typically with per core memory savings of 20 to 40%. We recommend users to use the hybrid QE code in their calculations to get the performance benefit. Also, since the hybrid QE uses less memory than its flat MPI version,  for users who need to run larger memory QE jobs that have to run on the unpacked nodes, we recommend them to use the hybrid QE instead of leaving some of the cores idle (your account will be charged for the idle cores as well), which can help them to get "extra" performance gain by utilizing otherwise idling cores. Please refer to our Quantum Espresso page regarding how to run the hybrid QE on Hopper.