Using OpenMP with MPI
Overview
Adding OpenMP threading to an MPI code is an efficient way to run on multicore processors and nodes like those on Hopper. Since OpenMP operates in a shared memory space, it is possible to reduce the memory overhead associated with MPI tasks and reduce the need for replicated data across tasks. More details on OpenMP (such as standard and tutorials) can be found at OpenMP Web Site.
Codes typically use one OpenMP thread per compute core. Therefore the maximum number of threads per node on Hopper is 24. OpenMP performance can be very dependent on the underlying architecture. On Hopper, NERSC has found a "sweet spot" for most codes at 4 MPI tasks per node and 6 OpenMP threads per MPI task. See more information at OpenMP Performance Implications on Hopper.
Compiling to use OpenMP
OpenMP is supported in all four programming environments available on Franklin and Hopper; however, each compiler suite has a different syntax for enabling OpenMP.
| Compiler Suite | Programming Environment module name | Command line option for OpenMP |
|---|---|---|
| Portland Group | PrgEnv-pgi | -mp |
| Pathscale | PrgEnv-pathscale | -mp |
| Cray Compilers | PrgEnv-cray | none needed |
| Intel Compilers (Hopper only) | PrgEnv-intel | -openmp |
| GNU Compilers | PrgEnv-gnu | -fopenmp |
Running using OpenMP
You must inform the system how you want to run your job.
- How many nodes do you want to use?
- How many total MPI tasks do you want?
- How many MPI tasks do you want to run on each node?
- How many OpenMP threads do you want per MPI task?
- How do you want the MPI tasks distributed among the "NUMA nodes" on Hopper?
- Do you want your OpenMP threads to have access only to the fastest memory pool (closest to it) on the node?
1. How many nodes?
In your batch script, use the directive:
#PBS -lmppwidth=cores_per_node*number_of_nodes
The value of cores_per_node is 24 for Hopper.
If you are running an interactive parallel job, use the following to start your parallel environment
qsub -I -qinteractive -V -lmppwidth=cores_per_node*number_of_nodes
2. How many total MPI tasks? (-n option to aprun)
In your batch script, or in your interactive parallel environment, launch your parallel job with "aprun" and use the -n option, followed by the total number of MPI tasks.
3. How many MPI tasks per compute node? (-N option to aprun)
Use the -N option to aprun to specify the number of MPI tasks to run on each node.
4. How many OpenMP threads per MPI task? (-d option to aprun, OMP_NUM_THREADS variable)
Another aprun option, -d, sets the number of OpenMP threads per MPI task. You must also set the environment variable OMP_NUM_THREADS to this same value. Make sure that this value multiplied by the value for -N does not exceed 24 for Hopper.
5. MPI task distribution on the node (-S option to aprun, optional)
You'll get better performance if you distrubute your MPI tasks among the 4 "NUMA nodes" that constitute each Hopper compute node. You do that with the -S option to aprun, which tells the system how many different MPI tasks to run on a single "NUMA node." Valid values are 1-6.
6. Memory affinity and performance (-ss option to aprun, optional)
You code will perform better if each OpenMP thread is limited to using the memory closest to it on the node. The -ss option to aprun will restrict each thread to using the memory nearest to its NUMA node. The thread will thus also be limited to using 1/4 the memory on the node.
7. Was the code compiled with the Pathscale Compiler?
You must set run time environment variable "PSC_OMP_AFFINITY" to "FALSE" before the aprun command.
8. Was the code compiled with the Intel Compiler?
You must use one of these 2 arguments to aprun: "-cc none" or "-cc numa_node".
Putting it all together
Here is a simple batch script for running an MPI/OpenMP code names "hybrid.x" using 4 nodes and the NERSC-recommended 4 MPI tasks per node and 6 OpenMP threads per MPI task. One MPI task is run per NUMA node and memory is limited to that closest to each OpenMP thread. This example uses the bash shell.
#!/bin/bash -l
#PBS -q regular
#PBS -l mppwidth=96
#PBS -V
#PBS -l walltime=1:00:00
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=6
# Pathscale compiler: export PSC_OMP_AFFINITY=FALSE
aprun -n 16 -d 6 -N 4 -S 1 -ss ./hybrid.x
# Intel compiled program: aprun -n 16 -d 6 -N 4 -cc numa_node ./hybrid.x
Sample Batch Scripts
The following example is based on a code at the OpenMP.org site. The code "solves a finite difference discretization of Helmholtz equation... using a Jacobi iterative method."
Below is the sample batch script to run the same executable as above using 1 compute node, with 4 MPI task per node, and 6 OpenMP threads per MPI task on Hopper:
#PBS -N jacobi
#PBS -q debug
#PBS -l mppwidth=24
#PBS -l walltime=00:10:00
#PBS -e jacobijob.out
#PBS -j eo
#PBS -V
cd $PBS_O_WORKDIR
setenv OMP_NUM_THREADS 6
aprun -n 4 -N 4 -S 1 -ss -d 6 ./jacobi_mpiomp
#PBS -N jacobi
#PBS -q debug
#PBS -l mppwidth=48
#PBS -l walltime=00:10:00
#PBS -e jacobijob.out
#PBS -j eo
#PBS -V
cd $PBS_O_WORKDIR
setenv OMP_NUM_THREADS 6
aprun -n 8 -N 4 -S 1 -ss -d 6 ./jacobi_mpiomp
Supported Thread Levels
MPI defines four “levels” of thread safety. The default thread support level for all four programming environments on Hopper (pgi, pathscale, cray and gnu) is MPI_THREAD_SINGLE, where only one thread of execution exists. The maximum thread support level is returned by the MPI_Init_thread() call in the "provided" argument.
You can set an environment variable MPICH_MAX_THREAD_SAFETY to different values to increase the thread safety.
| envronment variable MPICH_MAX_THREAD_SAFETY value | Supported Thread Level |
|---|---|
| not set | MPI_THREAD_SINGLE |
| single | MPI_THREAD_SINGLE |
| funneled | MPI_THREAD_FUNNELED |
| serialized | MPI_THREAD_SERIALIZED |
| multiple | MPI_THREAD_MULTIPLE |
Further Related Information
OpenMP Performance Implications on Hopper


