Parallel programming models at NERSC
Since the transition from vector to distributed memory (MPP) supercomputer architectures, the majority of HPC applications deployed on NERSC resources have evolved to use MPI as their sole means of expressing parallelism. As single processor core compute nodes on MPP architectures gave way to multicore processors, applying the same abstraction (processes passing messages) to each available core remained an attractive alternative - no code changes were required, and vendors made an effort to design optimized fast-paths for on-node communication.
However, as on-node parallelism rapidly increases and competition for shared resources per processing element (memory per core, bandwidth per core, etc.) does as well, now is a good time to assess whether applications can benefit from a different abstraction for expressing on-node parallelism. Examples of desirable functionality potentially available through the latter include more efficient utilization of resources (e.g. through threading) or the ability to exploit unique architectural features (e.g. vectorization).
Toward Cori and beyond: Performance and portability
Cori Phase II system, arriving in mid-2016, will continue this trend toward greater intra-node parallelism. The Knights Landing processor will support up to 72 cores per node, each supporting up to four hardware threads and possessing two 512-bit wide vector processing units.
In order to help our users prepare their applications for this new architecture, NERSC has made an effort to provide guidance on parallel programming approaches that we believe will work well on Cori. Chief among these is the combination of MPI for inter-node parallelism and OpenMP for intra-node parallelism (or potentially MPI per NUMA domain with OpenMP within each).
Why MPI + OpenMP?
The reasons we've chosen to suggest this approach to our users are many. Most importantly, it will provide:
- A model that allows application developers to think differently about inter- vs. intra-node parallelism (which will be key to obtaining good performance);
- A "gradual" onramp (in terms of application refactoring / modification) for existing pure-MPI applications to isolate and express intra-node parallelism;
- A convenient (compiler-directive agnostic) way of expressing SIMD parallelism at the loop or function level; and
- A model that could potentially offer portability across a range of new supercomputing architectures characterized by increased intra-node parallelism (especially with the OpenMP device directives introduced in 4.0, and subsequent improvements in 4.1).
We must stress, however, that MPI + OpenMP might not be the right choice for all applications. For example, applications that already make use of a different threading abstraction (e.g. Pthreads, C++11, TBB, etc.), or use a PGAS programming model for one-sided communication, or use a thread-aware task-based runtime system, may already have chosen for model that maps well to intra-node parallelism as well as inter-node.
Indeed, the key point we would like to make is that inter and intra-node parallelism must be understood and treated differently in order to obtain good performance on many-core architectures like Cori.
Example programming models for Cori
Although not meant to be an exhaustive list, here we briefly examine a number of options that we see as potentially attractive to our users, while considering whether and how each treats inter- and intra-node parallelism differently.
While pure MPI using the classic two-sided (non-RMA components of MPI-2) messaging model will indeed work on NERSC Cori, it fails to address many of the concerns we raised above. We do not aniticipate that this approach will perform well for most applications.
MPI + MPI
With the MPI-3 standard, shared memory programming on-node is possible via MPI's remote memory access (RMA) API, yielding an MPI + MPI model (RMA can also be used off-node). The upside of this approach is that one requires only one library for parallelism. For a program with shared memory parallelism at a very high level where most data is private by default, this is a powerful model.
MPI + OpenMP
As noted above, OpenMP provides a way to express on-node parallelism (including SIMD) with ease at a relatively fine level. In recent years, overhead due to thread team spin-up, fork and join operations and thread synchronization has been reduced drastically in common OpenMP runtimes.
MPI + X
While MPI + OpenMP was covered in some detail above, we recognize that other options are possible under an MPI + X approach. For example, one could use a different threading model to express on-node parallelism, such as native C++ concurrency primitives (available since C++11 and likely to improve considerably in 14 and 17) or Intel's TBB, as well as data container / execution abstractions like Kokkos.
PGAS and PGAS + X
The partitioned global address space (PGAS) model can be a great fit for applications that benefit from one-sided communication (e.g. distributed data structures supporting fast random access, or low latency point to point communication enabled by RDMA). A global address space enables developers to share data structures both off- and on-node, reducing memory pressure. Further, many PGAS runtimes offer interoperability with other parallel runtimes, such as MPI-IO, or OpenMP for expressing loop level or other fork / join parallelism constructs, while some (such as Berkeley UPC) go so far as to offer the ability to transparently replace processes with POSIX threads.
Task-based models and abstractions (such as offered by CHARM++, Legion and HPX, for example) offer many attractive features for mapping computations onto many-core architectures, including light-weight threads, futures and asynchronous execution, and general DAG-based scheduling. Because of their emphasis on application-defined (over-) decomposition and intelligent scheduling of tasks, we anticipate these types of models will see increased interest among application developers simultaneously targeting both many-core and traditional latency optimized architectures.
Efforts toward next-generation programming models
There are a number of ongoing efforts in the HPC research community to develop new programming systems geared toward future exascale architectures. The DOE X-Stack program in particular is one such centralized effort that includes projects integrating many of the key programming abstractions noted above, such as DAG-based execution and global address space communication models.