Hopper Multi-Core FAQ
Q. How is Hopper Different than Franklin?
A. The new Hopper Phase-II system will have 24 cores per node. Franklin had only four.
Q. What else is different?
A. There is less memory per core. Hopper has 1.3 GB / core rather than 2.0 GB / core on Franklin. A code using MPI on Hopper may be more likely to exhaust available memory, causing an error.
Additionally, Hopper's memory hierarchy is "deeper" and more non-uniform than Franklin's and this can have a big impact on performance in certain cases.
Hopper’s 24 cores per node are implemented on two sockets, each containing two six-core dies (see the image below). Each of the six-core dies has direct access to one-quarter of the node's total memory. Thus, although all the node's memory is transparently accessible, the access time from a given core can vary significantly depending on which socket and die the memory is located. Optimal programming may need to take this non-uniform memory architecture (NUMA) into account. In contrast, Franklin’s four cores are on a single die and memory access times are uniform.
Q. Will my existing code run on Hopper?
Probably, yes, your MPI code will run on Hopper. However, the decrease in memory available per core may cause problems. You may see an error message such as "“OOM killer terminated this process.” "OOM" means Out of Memory and it means that your code has exhausted the memory available on the node.
Q. Why can MPI-only applications be memory inefficient?
Typically MPI applications require ‘extra’ memory, above what N copies of a serial code would require, for several reasons:
- There is a significant amount of serial code in an MPI application, all of which must be replicated.
- Grid-based codes need memory to hold "ghost cells" for each MPI task.
- Communication buffers are required within the code. For example, to store the ‘ghost cells’ in a finite difference algorithm, or to provide space for packing and unpacking of messages.
- The MPI library itself allocates buffers for temporary storage as a place to store messages that are “in flight.”
Q. What Other Parallel Programming Models are Available?
A programming model that uses MPI for internode communication and OpenMP for on-node parallelism may reduce a code's memory requirements significantly.
Q. Why does using OpenMP help?
OpenMP allows an application to exploit parallelism using threads rather than processes. On-node CPU cores are utilized to perform parallel work within a shared memory space. Thus the duplication of memory resources that is often required by MPI is no longer needed, as the parallelism is expressed within the same process, and no explicit message passing between threads is required. A second benefit comes from using fewer MPI processes. This means messages are larger, or less tasks are participating in a collective operation, which also increases performance.
Q. We tried using OpenMP many years ago on clusters of SMPs but found the payoff to be minimal or negative. What is different now?
First, we reiterate the declining available memory per core as a key difference. This basically limits our choices more than it did in the past.
Although many of the issues are the same as they were in the past, there are several key architectural differences between a Hopper node and the SMPs of days past. For example, the ratio of intra-socket to inter-node MPI latency and bandwidth on a machine like Hopper is some 10-100 times higher than was the ratio of intra-processor to inter-processor latency and bandwidth in older SMPs. An MPI-only code will just not as effectively exploit the vastly improved latencies and bandwidths available from a chip multiprocessor such as the Hopper Magny-Cours. See below for a more detailed look at OpenMP benefits and drawbacks.
To review: MPI + OpenMP may not be faster than pure MPI but it will almost certainly use less memory.
Q. Why did you purchase a system with only 1.33 GB of memory per core if science applications need more than that?
NERSC configured the system to support its highly diverse workload. The total cost of a system depends on many things, but memorysize is a very significant fraction of the total.
Q. Hybrid MPI-OpenMP programming is complicated. Are there other alternatives?
If your code exceeds available memory on Hopper the easiest solution may be to run with fewer active cores per node leaving some cores idle. However, this is inefficient because you may need to use more nodes and your NERSC repo will be charged for all the nodes you use. See the user documentation pages for information on how to run with fewer cores per node.
We believe that MPI + OpenMP is actually the simplest approach to hybrid programming. Other alternatives include MPI plus explicit threading, one-sided communication models, or MPI plus Partitioned Global Address Space (PGAS) programming models such as UPC, or PGAS models alone. These methods require considerably more code rewrite.
Q. What should I expect for the performance of my program on Hopper Phase-2?
There are many factors that contribute to an application’s performance. The Hopper processor clock speed is slightly slower than Franklin's but Hopper’s memory speed is considerably faster. The Hopper Gemini interconnect has lower latency and higher bandwidth than the Seastar on Franklin. Generally, it is likely that for an MPI-only code the per-core performance of Hopper will be roughly the same as Franklin.
Q. Why are MPI + OpenMP codes sometimes slower than MPI alone?
There are four common reasons why this might happen:
- There is a portion of the code (in terms of runtime) that is not OpenMP parallelized or that contains a serializing construct such as a critical section or atomic operation;
- The loops that are being parallelized with OpenMP are too small to offset the overhead required to create threads.
- The OpenMP domain is spanning more than one memory domain and is seeing NUMA effects. On Hopper this would correspond to using more than six threads.
- The are data consistency effects that lead to extraneous data movement (false sharing of cache lines).
Q. I keep hearing about the “first touch” principle. What does that mean actually?
On a NUMA system such as Hopper, memory is mapped to the NUMA node (processor die) containing the
core that first touches that memory. Here, "touching" memory simply means writing to it. The important point is that if a thread running on NUMA node 0 references a memory location that has been first touched by a thread running on NUMA nodes 1-3 then the retrieval of this non-local memory will cause a delay. If there are many such remote accesses the code will perform poorly.
There are several ways to improve the performance of threaded applications:
- Have each thread initialize the memory that it will later be processing.
- Always initialize memory immediately after allocating it.
- Initialize memory in parallel regions, not in serial code.
- Constrain your runs to six threads or fewer. This is because there are are six cores per Hopper NUMA node.
Q. What are the benefits and drawbacks associated with OpenMP?
As mentioned above, the principle benefit to using OpenMP is that it can decrease memory requirements, with, in most cases, almost equal performance. There are also several other benefits, including:
- Potential additional parallelization opportunities on top of those exploited by MPI;
- Less domain decomposition, which can help with load balancing as well as allowing for larger messages and fewer tasks participating in MPI collective operations;
- OpenMP is a standard, so any modifications introduced into an application are portable and appear as comments on systems not using OpenMP;
- OpenMP can be added to a code somewhat incrementally, almost on a loop-by-loop basis, by adding annotations to existing code and using a compiler option. If you have a code that vectorizes well the vector loops are good candidates for OpenMP.
There are also some potential drawbacks:
- OpenMP can be hard to program and/or debug in some cases;
- When using more than six threads on Hopper it is very important to pay careful attention to NUMA & locality control;
- If an application is network- or memory bandwidth-bound then threading is not going to help. In this case it will be OK to leave some cores idle.
- In some cases a serial portion maybe essential and that can inhibit performance;
- In most MPI codes synchronization is implicit and happens when messages are sent and received but with OpenMP much synchronization must be added to the code explicitly. Parallel scoping, which means determining which variables can be shared among threads and which ones cannot, must also be done explicitly by the programmer. OpenMP codes that have errors introduced by incomplete or misplaced synchronization or improper scoping can be difficult to debug because the error can introduce “race” conditions that cause the error to happen only intermittently.
Q. Is there a typical parallelism scenario for hybrid MPI / OpenMP codes?
A typical scenario is to use MPI for domain decomposition with four or eight MPI processes per node and then use the remaining cores for OpenMP threads of parallelism within each domain. Frequently, this additional parallelism is at the loop level but the more computation per thread, the better. The threads belonging to each MPI process carry out their computation until some synchronization point or until they’ve completed. It is important to remember not to use more than six OpenMP threads per NUMA node (24 per node).
Q. I don’t run my own application - I use application software installed by NERSC. What should I do?
NERSC is exploring some of the applications that we provide to determine if they already have OpenMP constructs in the source. If they do we will compile and test them, and provide instructions to users on running them. For other codes we will explore the possibility of adding OpenMP. NERSC’s efforts will probably be devoted to codes that have the biggest usage. If you encounter issues please contact consult .at. nersc.gov.
Q. This sounds like a very important change in the high performance computing community. What else is NERSC doing to help?
NERSC and Cray have been participating in a joint “Center of Excellence” study over the last year to examine hybrid programming for a few key applications chosen from the NERSC workload. In a nutshell, the study underscored several of the points noted above as well as others, namely: that performance of hybrid codes is often, although not always, better than MPI alone; that memory usage is generally more efficient; that performance often improves up to six OpenMP cores but decreases with 12 because of NUMA effects; and that different portions of codes respond differently so that individual kernels may be slower with OpenMP but the code overall may be faster.
Q. What kind of programs have been adapted for Hybrid MPI + OpenMP already?
Some examples from the NERSC workload include the Gyrokinetic Toroidal Code (GTC) for particle-in-cell plasma physics; the Parallel Total Energy Code (PARATEC) for electronic structure via the DFT method; the finite-volume Community Atmosphere Model (fvCAM); and the General Atomic and Molecular Electronic Structure System (GAMESS).
Q. What about the future?
The Hopper architecture is an example of what the future will most likely bring: multi-socket nodes with rapidly increasing core counts and decreasing memory per core. These changes are a direct result of what appear to be immutable trends within the semiconductor industry. It is a topic of open debate whether or not MPI will continue to be a viable programming model for such systems. Even if your MPI code runs on Hopper without change, it is a good idea to begin considering alternatives. MPI-only codes may work with 24 cores per node but with 100s per node it’s much less likely.
The Hopper processor is an AMD “Magny-Cours” that contains two six-core dies on a single multi-chip module. Each die has two memory paths and four HyperTransport3 links.
Each Hopper node consists of two Magny-Cours processors, 32 or 64 GB of memory, and the Gemini interconnect. Each six-core die in the Magny-Cours processor constitutes a "NUMA node" and there are therefore, four NUMA nodes per Hopper node. Each NUMA node has local access, via the two memory paths, to one-quarter of the Hopper node memory and remote access to all the rest of the memory.
Processes and Threads
Each MPI rank constitutes a process - an independent program with its own memory (address) space and instruction stream. In most MPI implementations today the number of processes in a program is fixed when the program is launched. Processes can share information only via some inter-process communication mechanism, such as generally MPI or sometimes TCP/IP sockets. In process-parallel codes with synchronous, blocking communication, most synchronization is implicit and is taken care of by the MPI library during communication operations.
A process may dynamically create one or more independent threads of execution that share the process’s memory (address) space with the parent process and with other threads created by that parent. A thread’s memory space is generally partitioned logically into shared and private regions. Threads share information by reading from or writing to shared memory locations. Although some parallelism in threaded codes is created implicitly (such as in a parallel DO loop), synchronization is generally explicit - the programmer must identify to the compiler which data structures are shared and which are private.