Using OpenMP with MPI
Edison nodes contain two processors with 12 cores each and each processor constitutes a NUMA node. Codes typically use one OpenMP thread per physical compute core; however, hyperthreading (HT) is also possible, which would allow two threads per physical compute core. Without HT, the maximum number of OpenMP threads per node on Edison would be 24. OpenMP performance can be very dependent on the "mapping" of OpenMP threads to the architecture. On Edison, most likely the best choice may be something like 1-4 MPI processes per NUMA node with 12-3 OpenMP threads each. You should experiment with your code to find the best combination of HT, MPI processes, and OpenMP threads. Substituting OpenMP threading for MPI parallelism is an excellent strategy on Edison.
Running using OpenMP
To use OpenMP with MPI, the code must be compiled with openmp enabled. Please refer to Using OpenMP for more details on programming. Then you must request the correct value for mppwidth; modify the aprun command line; and also set the OMP_NUM_THREADS environment variable. Key aprun options are as follows (assumes no HyperThreading):
- Total number of MPI tasks (-n)
- Number of MPI tasks per Edison node (-N); maximum of 24
- Number of OpenMP threads (-d); maximum of 24
- Options to control memory and core affinity (-S, -sn, -sl, -cc)
Here are some examples assuming an initial domain decomposition with 192 MPI tasks that is then reduced via an additional level of parallelism for OpenMP threads:
Use #PBS -l mppwidth=192 for all
192 MPI: aprun -n 192 a.out
96 MPI + 2 OpenMP threads per MPI, 1/2 as many MPI tasks per node: aprun -n 96 -N 12 -S 6 -d 2 a.out
48 MPI + 4 OpenMP threads per MPI, 1/4 as many MPI tasks per node: aprun -n 48 -N 6 -S 3 -d 4 a.out
16 MPI + 12 OpenMP threads per MPI, 1/12 as many MPI tasks per node: aprun -n 16 -N 2 -S 1 -d 12 a.out
Please refer to sample batch scripts for running hybrid MPI/OpenMP jobs on the Edison Example Batch Scripts webpage. Notice the different "-cc" options are needed for Intel compiled programs.
MPI task distribution on the node (-S option to aprun, optional)
You'll probably get better performance if you distribute the MPI tasks among the two NUMA nodes within each Edison compute node. Use the aprun -S option, which specifies the number of MPI tasks per NUMA node. Valid values are 1-12 for Edison. The default value (without using -S) is 12, meaning that aprun will pack all MPIs on one NUMA node.
Memory affinity and performance (-ss option to aprun, optional)
Your code may perform better if each OpenMP thread is limited to using the memory closest to it on the node. The -ss option to aprun will restrict each thread to using the memory nearest to its NUMA node. Any thread will thus be limited to using only 1/2 the memory on the node. The default behavior without the "-ss" option will use local memory first if possible. You can also experiment with the aprun -cc option, which can bind computation to hardware cores, using either "-cc none" or "-cc numa_node." , especially when using Intel compilers. See the Example Batch Scripts examples and aprun man pages for more details.
Supported Thread Levels
MPI deﬁnes four “levels” of thread safety. The default thread support level for all three programming environments on Edison (intel, cray and gnu) is MPI_THREAD_SINGLE, where only one thread of execution exists. The maximum thread support level is returned by the MPI_Init_thread() call in the "provided" argument.
You can set an environment variable MPICH_MAX_THREAD_SAFETY to different values to increase the thread safety.
|envronment variable |
|Supported Thread Level|
Using MPI with a fully-threaded code might involve something similar to the following (assuming Intel programming environment):
CC -o test.x TestMPI.cpp -openmp
setenv OMP_NUM_THREADS 6
setenv MPICH_MAX_THREAD_SAFETY multiple
aprun -n4 -N4 -S1 -d6 test.x
See man intro_mpi for more information, especially regarding performance implications of MPI_THREAD_MULTIPLE.
Nested OpenMP is supported on Babbage. Please see more informaiton on example code and thread affinity control settings here.