Using OpenMP with MPI
Adding OpenMP threading to an MPI code is an efficient way to run on multicore processors. Since OpenMP uses a global shared address space within each node, using OpenMP may reduce memory usage while adding parallelism. It can also reduce time spent in MPI communications. More details on OpenMP, including the standard itself and tutorials) can be found at the OpenMP Web Site. An interesting advantage of OpenMP is that you can add it incrementally to an existing code.
Codes typically use one OpenMP thread per physical compute core. Therefore, the maximum number of threads per node on Edison is 16. However, OpenMP performance can be very dependent on the underlying architecture and on the "mapping" of OpenMP threads to the architecture. On Edison, most likely the best choice is 1-4 MPI processes per node with 8-2 OpenMP threads each.
Compiling to use OpenMP
OpenMP is supported in all three programming environments available on Edison; however, each compiler suite has a different syntax for enabling OpenMP.
|Compiler Suite||Programming Environment module name||Command line option for OpenMP|
|Cray Compilers||PrgEnv-cray||none needed; OpenMP is the default|
Running using OpenMP
To use OpenMP with MPI you must: use the correct value for mppwidth; modify the aprun command line; and also set the OMP_NUM_THREADS environment variable. Key aprun options are as follows (assumes no HyperThreading):
- Total number of MPI tasks (-n)
- Number of MPI tasks per Edison node (-N); maximum of 16
- Number of OpenMP threads (-d); maximum of 16
- Options to control memory and core affinity (-S, -sn, -sl, -cc)
Here are some examples assuming an initial domain decomposition with 128 MPI tasks that is then reduced via an additional level of parallelism for OpenMP threads:
Use #PBS -l mppwidth=128 for all
128 MPI: aprun -n 128 a.out
64 MPI + 2 OpenMP threads per MPI, 1/2 as many MPI tasks per node: aprun -n 64 -N 8 -S 4 -d 2 a.out
32 MPI + 4 OpenMP threads per MPI, 1/4 as many MPI tasks per node: aprun -n 32 -N 4 -S 2 -d 4 a.out
16 MPI + 4 OpenMP threads per MPI, 1/8 as many MPI tasks per node: aprun -n 16 -N 2 -S 1 -d 8 a.out
Please refer to sample batch scripts for running hybrid MPI/OpenMP jobs on the Edison Example Batch Scripts webpage. Notice the different "-cc" options are needed for Intel compiled programs.
MPI task distribution on the node (-S option to aprun, optional)
You'll probably get better performance if you distrubute the MPI tasks among the two NUMA nodes that constitute each Edison compute node. Use the aprun -S option, which specifies the number of MPI tasks per NUMA node. Valid values are 1-8 for Edison. The default value (without using -S) is 8, meaning that aprun will pack all MPIs on one NUMA node.
Memory affinity and performance (-ss option to aprun, optional)
You code may perform better if each OpenMP thread is limited to using the memory closest to it on the node. The -ss option to aprun will restrict each thread to using the memory nearest to its NUMA node. Any thread will thus be limited to using only 1/2 the memory on the node. The deafault behavior without the "-ss" option will use local memory first if possible. You can also experiment with the aprun -cc option, which can bind computation to hardware cores, using either "-cc none" or "-cc numa_node." , especially when using Intel compilers. See the Example Batch Scripts examples and aprun man pages for more details.
Supported Thread Levels
MPI deﬁnes four “levels” of thread safety. The default thread support level for all three programming environments on Hopper (intel, cray and gnu) is MPI_THREAD_SINGLE, where only one thread of execution exists. The maximum thread support level is returned by the MPI_Init_thread() call in the "provided" argument.
You can set an environment variable MPICH_MAX_THREAD_SAFETY to different values to increase the thread safety.
|envronment variable |
|Supported Thread Level|